This book is very strong on the fundamentals, while the R code is minimal and easy to follow.
This book is very strong on the fundamentals, while the R code is minimal and easy to follow.
I understand how SGD is just taking a step proportional to the gradient and how backprop computes the partial derivative of the loss function with respect to each model weight.
But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.
So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?
> how important is computing the exact gradient using calculus
Normally the gradient is computed with a small "minibatch" of examples, meaning that on average over many steps the true gradient is followed, but each step individually never moves exacty along the true gradient. This noisy walk is actually quite beneficial for the final performance of the network https://arxiv.org/abs/2006.15081 , https://arxiv.org/abs/1609.04836 so much so that people started wondering what is the best way to "corrupt" this approximate gradient even more to improve performance https://arxiv.org/abs/2202.02831 (and many other works relating to SGD noise)
> vs just knowing the general direction to step
I can't find relevant papers now, but I seem to recall that the Hessian eigenvalues of the loss function decay rather quickly, which means that taking a step in most directions will not change the loss very much. That is to say, you have to know which direction to go quite precisely for an SGD-like method to work. People have been trying to visualize the loss and trajectory taken during optimization https://arxiv.org/pdf/1712.09913 , https://losslandscape.com/
Scientists and academics demand an entirely different level of rigor compared to customers of LLM providers.
The iPhone 17 is the same price as the Pixel 10
> better
But the iPhone 17 has better hardware features, like UWB, better cameras, and a _far_ faster CPU.
> open source
Only if you install Graphene, and then never install anything that requires Google Play Services, which is basically every commercial app.
The (discrete) Fourier transform is also a linear transformation, which is why the initial effort of thinking abstractly in terms of vector spaces and transformations between them pays lots of dividends when it's time to understand more advanced topics such as the DFT, which is "just" a change of basis.
My appreciation for the subject grew considerably after working through the book "Linear Algebra done right" by Axler https://linear.axler.net
All progress starts out as a fringe belief.
https://en.wikipedia.org/wiki/Pharmacovigilance#Adverse_even...