This is part of my “Deeply Learning” repo where I ramp up my chops both in-terms of reading seminal paper while also picking up the nitty-gritties of torch.

Repo: Pythonista7/deeply-learning

Notes on Batch Norm

When training a deep model, the layer params distribution keeps changing with change in input, hence needing low lr ,hence slower training , more careful weights init to avoid the activation from saturation etc. To avoid this , welcome batch-norm.

Batch norm:

  • normalization as part of architecture.
  • normalization for each training mini-batch.

Normalization vs Regularization

Normalization : “Make the landscape smoother so SGD can walk without tripping.”

Regularization : “Don’t memorize; learn a rule that survives reality.”


covariate shift

The change in the distributions of layers’ inputs presents a problem because the layers need to continuously adapt to the new distribution. When the input distribution to a learning system changes, it is said to experience covariate shift (Shimodaira, 2000).

Paper uses analogy of how if having a stable distribution for network input helps Gradient-Descent , it must also help a subset of the network.

Internal Covariate Shift

We refer to the change in the distributions of internal nodes of a deep network, in the course of training, as Internal Covariate Shift.

Batch-Norm reduces Internal Covariate Shift Accelerates training of Deep Neural Nets.

All the advantages of using batch-norm

  • Higher learning rates can be used without risk of divergence
  • improved gradient flow through the network, removes dependence of gradients on the scale of params or inital values!
  • Has a regularization effect and reduces need for dropout

Batch Normalization makes it possible to use saturating nonlinearities by preventing the network from getting stuck in the saturated modes

  • ^^ (what it really means is) batch-norm helps avoid the dead-zones of inputs to an activation where the gradients of those activation start becoming useless and hence stop the flow of info during backprop. i.e: for very large or very small values of sigmoid the derivative is very close to 0 and any backprop value will not flow through to downstream layers , hence hindering effective learning because the activation’s became “saturated” and the gradients get into a “saturated mode”.

    ChatGPT gave me this fun visualizer to best understand this , check it out: https://chatgpt.com/canvas/shared/6968e4cf6a188191980853cf179bda75

Important point:

  • During training the distribution statistics for normalization are collected at a batch level, but during inference the statistics are derived from the full population instead of the mini-batch.

Read Section 2 AND 3 to deeply understand implementation. But the basic idea is to have a differentiable way to normalize inputs to activation’s in order to avoid vanishing/exploding grads. The way BN does this is by forming “mini-batches” and computing basic distribution statistics of this mini-batch , then using those values we generate a normalized value, this is then used to compute an affine-map of learnable(which come under grad update) param’s (scale and shift) which restore the representational ability back into the gradient descent which would otherwise taken away if we just normalized.

My chat with GPT about specifics of batch-norm and me asking doubts:

https://chatgpt.com/share/69694b14-3c00-800d-9a9a-38ac65e25f12