Pre vs Post Norms, With & Without Warmups

0. Experiment Meta

Experiment Name: Pre vs Post Norms, With & Without Warmups Category: Architecture Date: 25/03/2026 Code: Github

1. Objective

What single question is this experiment answering?

How does the norm-layer placement affect training stability and how the effect behaves across models of different depts. Hence we will be running an ablation study to verify the gradient stability across different model-depths and how also how its affected by a warmup schedule which is advised in the post-norm paper which we expect to have little/no relevance once we move to pre-norm.

2. Hypothesis

Mechanism: As per the This Paper , they claim that when we use post-norm, the gradients of last layers are $O (N)$ times lagers than ones at shallow layers, rendering the model and future training unstable right at the first run and only compounding in effect as we train the model more. Concretely:

At initialisation, the expected squared gradient norm for Post-LN grows as O(N) in depth N:

E[ \lVert \frac{\delta L}{\delta W_l^{(2)}} \rVert^2 ]=O(l)

Pre vs Post Norms, With & Without Warmups

0. Experiment Meta

1. Objective

2. Hypothesis

Graph View

Table of Contents

Backlinks