I would like to start off by putting down what i currently think optimizers do before we go deep and unpack them for what they are. I believe optimizers are the “vehicle” we ride down the loss landscape, the gradient gives us direction(GPS) and the learning-rate acts like our throttle but the optimizer is the one that puts these 2 together (in different ways depending on the optim) and actually makes a step in the loss-terrain.

I will be taking a step back historically to truly grasp what optimizers mean and do, how does one think about them from a mathematically explainable way ? That is the goal of this exercise.

Reading: An Overview Of Gradient Descent Optimization Algorithms

Took a short stab at reading the Adam implementation in torch , pretty interesting that they implement 3 version: (1) A simple for loop (2) A batch for-each (3) A cuda kernal for adam.