Paper: An overview of gradient descent optimization algorithms

But I kinda prefer reading the Blog of the same :

https://www.ruder.io/optimizing-gradient-descent/

Another nice blog: https://johnchenresearch.github.io/demon

Another great ref material on the same from CS231n: here

This note is going be just for me to always have a quick refresher lookup for when things feel muddy, will keep adding and updating the optimizers and their understanding as needed.

Gradient Descent

Batch gradient descent

θ = θ - η \cdot \nabla_{θ} J (θ)

for i in range(nb_epochs):
  params_grad = evaluate_gradient(loss_function, data, params)
  params = params - learning_rate * params_grad

Gradients are calculated once per epoch for the FULL dataset and one update is performed.

Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for non-convex surfaces.

Stochastic Gradient Descent

θ = θ - η \cdot \nabla_{θ} J (θ; x (i); y (i)) .

for i in range(nb_epochs):
  np.random.shuffle(data)
  for example in data:
    params_grad = evaluate_gradient(loss_function, example, params)
    params = params - learning_rate * params_grad

grads are calculated after every sample seen and update is performed per sample.
high variance updates, can eventually converge to global minima if lr decay is used.

Mini-batch gradient descent

θ = θ - η \cdot \nabla_{θ} J (θ; x (i : i + n); y (i : i + n)) .

for i in range(nb_epochs):
  np.random.shuffle(data)
  for batch in get_batches(data, batch_size=50):
    params_grad = evaluate_gradient(loss_function, batch, params)
    params = params - learning_rate * params_grad

Lower variance compared to vanilla-SGD
can make use of hardware optim for mat-muls with optimal mini-batch-sizes

Gradient Descent Optimisations

Momentum

Amplifies relative direction of GD while also dampening oscillations
Done by adding $γ$ to the previous step update vector $v_{t - 1}$ in order to calculate current update vector $v_{t}$

v_{t} = γ . v_{t - 1} + η . \nabla_{θ} J (θ)

θ = θ - v_{t}

Nesterov accelerated gradient

This is using the idea of momentum and performing a look-ahead on the cost function giving an anticipatory adapting behaviour to updates.

v_{t} = γ . v_{t - 1} + η . \nabla_{θ} J (θ - γ . v_{t - 1})

θ = θ - v_{t}

Adagrad

Provides adaptive learning rates per parameter
Instead of all params getting same learning_rate for update, params associated with more frequent features get more frequent updates.

Let $g_{t, i} = \nabla_{θ} J (θ_{t, i})$ be the gradient at time step t for the i’th param. Then the update rule is:

θ_{t + 1, i} = θ_{t, i} - η \cdot g_{t, i} .

Now Ada modifies the learning rate such that instead of a static $η$ it is based on param $i$ :

θ_{t + 1, i} = θ_{t, i} - \frac{η}{G _{t, ii} + ϵ} \cdot g_{t, i}

With just $G_{t}$ containing the sum of the squares of the past gradients w.r.t all params $θ$ along its diagonal (hence t,ii in $G_{t, ii}$ ), the above can we vectorized as a matrix vector prod between $G_{t}$ and $g_{t}$ as:

θ_{t + 1} = θ_{t} - \frac{η}{G _{t} + ϵ} ⊙ g_{t}

Main adv of adagrad being that we do not need to tune the learning_rate.
Main drawback, the accumulation of sqr of grads in the denominator , since all sqr’s are positive the number keeps growing, which inturn causes the actual updates for the step $g_{t}$ to shrink and after a point becomes too small to cause meaningful updates to weights.

Adadelta

This just fixes the Adagrad by having a window( $w$ ) for calculating $G_{t}$ which reduces the effect of shrinking learning_rate.
Instead of inefficiently storing all previous $w$ grads, they calculate a moving average at each time-step $t$ as $E [g^{2}]_{t}$ and then use that that calculate the next moving avg as :

E [g^{2}]_{t} = γ . E [g^{2}]_{t - 1} + (1 - γ) g_{t}^{2}

Now the update term , we just replace the $G_{t}$ from the Adagrad with the above moving average:

Δ θ_{t} = \frac{- η}{E [ g ^{2} ] _{t} + ϵ} g_{t}

Now we can see that the demonimator is just the root-mean-sqr of the gradients, hence rewritting it as:

Δ θ_{t} = \frac{- η}{RMS [ g ] _{t}} g_{t}

Then the authors do something fun! Dimensional analysis on the update term ( where params, loss and grad, i.e: L/ $θ$ units, are treated as having their own units), they find the LHS != RHS units! hence, they balance it out by adding another RMS but this time for the param update itself, hence it becomes:

Δ θ_{t} = - \frac{RMS [ Δ _{θ} ] _{t - 1}}{RMS [ g ] _{t}} g_{t}

θ_{t + 1} = θ_{t} + Δ θ_{t}

Now this doesnt even require a default learning rate! Its auto-tuned by the rate of grad and param updates! Super cool!

RMSprop

RMSProp is just the Adadelta without the units correction and a slight assumption of $γ = 0.9$ in calculating the moving avg for gradients as below:

E [g^{2}]_{t} = 0.9 E [g^{2}]_{t - 1} + 0.1 g_{t}^{2}

θ_{t + 1} = θ_{t} - \frac{η}{\sqrt E [ g ^{2} ] _{t} + ϵ} g_{t}

Adam

Adaptive Moment Estimation (Adam)
Like RMSprop and Adadelta Adam also tracks of decaying avg of past squared grads in $v_{t}$ as well as past grads* like momentum in $m_{t}$ :
- $m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}$
- $v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$
$m_{t}$ and $v_{t}$ are estimates of the first moment(mean) and second moment (variance) respectively, hence the name.
Since the above m and v are vectors and initialized to 0s, they are biased to the initial 0 value hence its de-biased by:

\overset{m}{^}_{t} = \frac{m _{t}}{1 - β _{1}^{t}}

\overset{v}{^}_{t} = \frac{v _{t}}{1 - β _{2}^{t}}

And the update rule becomes:

θ_{t + 1} = θ_{t} - \frac{η}{v _{t} ^ + ϵ} \overset{m}{^}_{t}

AdamW

Popular notion, larger magnitude of weights = overfitting, so controlling that is seemingly useful. To do that we have “weight decay”
If we scale the gradient at timestep $t$

g_{t} = \nabla f (θ_{t}) + w_{t} θ_{t}

Then AdamW makes it implicitly a part of the gradient update as

θ_{t + 1} = θ_{t} - η (\frac{1}{v _{t} ^ + ϵ} \overset{m}{^}_{t} + w_{t} θ_{t})

Turns out this makes a different in practice

Adam Vs AdamW Insight

Basically when Adam decays it decays the gradient term $g_{t}$ which effects both the momentum $m$ and variance estimate $v$ .

g_{t} ​ \leftarrow g_{t} ​ + λ θ_{t - 1} ​

Because the gradient is decayed and the momentum & variance are calculated based on $g_{t}$ the optimizer remembers the decay. So the weight decay plays a partial role in the update along with other variables like $β_{1}$ , $β_{2}$ and the grad_history.

When AdamW decays, it directly effects the parameter $θ$ and not the gradient term, hence causing a shrinkage irrespective of the gradients.

θ_{t} ​ \leftarrow θ_{t - 1} ​ - γλ θ_{t - 1}

So AdamW is classically just doing what it says, decaying the params directly and doesnt affect momentum or variance.

PS: Split screen the docs pseudo code algo for Adam and AdamW to see it yourself.

An Overview Of Gradient Descent Optimization Algorithms