Notes on Layer Norm

Main reason batch-norm doesn’t really work for RNN:

the summed inputs to the recurrent neurons in a recurrent neural network (RNN) often vary with the length of the sequence so applying batch normalization to RNNs appears to require different statistics for different time-steps.

What layer-norm does:

the proposed method directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training case

As per section 2 , “Background”, The main drawback of batch norm is the need of mean and variance for during training time , which for a normal NN can be (empirically) approximated based on the current mini-batch statistics but such approximations are not always possible for a RNN.

Important , Section 5.2 , the normalization scale (σ) effectively changes curvature / step sizes, acting like an implicit learning-rate control / stabilization mechanism as weights grow.

Why is this so?

Learning, however, can behave very differently under different parameterizations, even though the models express the same underlying function

That gradient $\nabla θ L \nabla_{θ} L \nabla θ L$ is taken in parameter space, not “function space.” If you change coordinates $θ \mapsto ϕ (θ) θ \mapsto ϕ (θ) θ \mapsto ϕ (θ)$ , the gradient direction and effective step size generally change.

When is it not great?

They basically admit LN underperforms BN on convnets in their prelim experiments, and give a plausible reason: units near image boundaries have very different activation statistics, breaking the “all units in a layer behave similarly” assumption that LN leans on.

There are alot of experimental finding i dont want to rabbit hole into for now. The core idea is captured.

Layer Norm

Notes on Layer Norm

Graph View

Backlinks