Paper: An overview of gradient descent optimization algorithms
But I kinda prefer reading the Blog of the same :
Another great ref material on the same from CS231n: here
This note is going be just for me to always have a quick refresher lookup for when things feel muddy, will keep adding and updating the optimizers and their understanding as needed.
Gradient Descent
Batch gradient descent
for i in range(nb_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad- Gradients are calculated once per epoch for the FULL dataset and one update is performed.
Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.
Stochastic Gradient Descent
for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad- grads are calculated after every sample seen and update is performed per sample.
- high variance updates, can eventually converge to global minima if lr decay is used.
Mini-batch gradient descent
for i in range(nb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad- Lower variance compared to vanilla-SGD
- can make use of hardware optim for mat-muls with optimal mini-batch-sizes
Gradient Descent Optimisations
Momentum
- Amplifies relative direction of GD while also dampening oscillations
- Done by adding to the previous step update vector in order to calculate current update vector
Nesterov accelerated gradient
- This is using the idea of momentum and performing a look-ahead on the cost function giving an anticipatory adapting behaviour to updates.
Adagrad
- Provides adaptive learning rates per parameter
- Instead of all params getting same learning_rate for update, params associated with more frequent features get more frequent updates.
Let be the gradient at time step t for the i’th param. Then the update rule is:
Now Ada modifies the learning rate such that instead of a static it is based on param :
With just containing the sum of the squares of the past gradients w.r.t all params along its diagonal (hence t,ii in ), the above can we vectorized as a matrix vector prod between and as:
- Main adv of adagrad being that we do not need to tune the learning_rate.
- Main drawback, the accumulation of sqr of grads in the denominator , since all sqr’s are positive the number keeps growing, which inturn causes the actual updates for the step to shrink and after a point becomes too small to cause meaningful updates to weights.
Adadelta
- This just fixes the Adagrad by having a window() for calculating which reduces the effect of shrinking learning_rate.
- Instead of inefficiently storing all previous grads, they calculate a moving average at each time-step as and then use that that calculate the next moving avg as :
- Now the update term , we just replace the from the Adagrad with the above moving average:
- Now we can see that the demonimator is just the root-mean-sqr of the gradients, hence rewritting it as:
- Then the authors do something fun! Dimensional analysis on the update term ( where params, loss and grad, i.e: L/ units, are treated as having their own units), they find the LHS != RHS units! hence, they balance it out by adding another RMS but this time for the param update itself, hence it becomes:
Now this doesnt even require a default learning rate! Its auto-tuned by the rate of grad and param updates! Super cool!
RMSprop
- RMSProp is just the Adadelta without the units correction and a slight assumption of in calculating the moving avg for gradients as below:
Adam
- Adaptive Moment Estimation (Adam)
- Like RMSprop and Adadelta Adam also tracks of decaying avg of past squared grads in as well as past grads* like momentum in :
- and are estimates of the first moment(mean) and second moment (variance) respectively, hence the name.
- Since the above m and v are vectors and initialized to 0s, they are biased to the initial 0 value hence its de-biased by:
- And the update rule becomes:
AdamW
- Popular notion, larger magnitude of weights = overfitting, so controlling that is seemingly useful. To do that we have “weight decay”
- If we scale the gradient at timestep
- Then AdamW makes it implicitly a part of the gradient update as
- Turns out this makes a different in practice
Adam Vs AdamW Insight
Basically when Adam decays it decays the gradient term which effects both the momentum and variance estimate .
Because the gradient is decayed and the momentum & variance are calculated based on the optimizer remembers the decay. So the weight decay plays a partial role in the update along with other variables like , and the grad_history.
When AdamW decays, it directly effects the parameter and not the gradient term, hence causing a shrinkage irrespective of the gradients.
So AdamW is classically just doing what it says, decaying the params directly and doesnt affect momentum or variance.
PS: Split screen the docs pseudo code algo for Adam and AdamW to see it yourself.