Notes on Layer Norm
Pre-read: RNN Overview
Main reason batch-norm doesn’t really work for RNN:
the summed inputs to the recurrent neurons in a recurrent neural network (RNN) often vary with the length of the sequence so applying batch normalization to RNNs appears to require different statistics for different time-steps.
What layer-norm does:
the proposed method directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training case
As per section 2 , “Background”, The main drawback of batch norm is the need of mean and variance for during training time , which for a normal NN can be (empirically) approximated based on the current mini-batch statistics but such approximations are not always possible for a RNN.
Important , Section 5.2 , the normalization scale (σ) effectively changes curvature / step sizes, acting like an implicit learning-rate control / stabilization mechanism as weights grow.
Why is this so?
Learning, however, can behave very differently under different parameterizations, even though the models express the same underlying function
That gradient is taken in parameter space, not “function space.” If you change coordinates , the gradient direction and effective step size generally change.
When is it not great?
They basically admit LN underperforms BN on convnets in their prelim experiments, and give a plausible reason: units near image boundaries have very different activation statistics, breaking the “all units in a layer behave similarly” assumption that LN leans on.
There are alot of experimental finding i dont want to rabbit hole into for now. The core idea is captured.