0. Experiment Meta

Experiment Name: Pre vs Post Norms, With & Without Warmups Category: Architecture Date: 25/03/2026 Code: Github


1. Objective

What single question is this experiment answering?

How does the norm-layer placement affect training stability and how the effect behaves across models of different depts. Hence we will be running an ablation study to verify the gradient stability across different model-depths and how also how its affected by a warmup schedule which is advised in the post-norm paper which we expect to have little/no relevance once we move to pre-norm.


2. Hypothesis

Mechanism: As per the This Paper , they claim that when we use post-norm, the gradients of last layers are times lagers than ones at shallow layers, rendering the model and future training unstable right at the first run and only compounding in effect as we train the model more. Concretely:

At initialisation, the expected squared gradient norm for Post-LN grows as O(N) in depth N:

E[ \lVert \frac{\delta L}{\delta W_l^{(2)}} \rVert^2 ]​​=O(l)

But when we use pre norm, the gradient magnitude is $O(1)$ irrespective of depth, which is why using pre-norm provides a much more stable gradients for updates. Prediction: We should be able to reproduce the plot from the paper which shows a layer-index vs layer-grad-norm showing the high magnitude of grads in the initial layers in bigger models. --- ## 3. Controlled Variables (Frozen) Everything below must remain identical across runs: - Dataset: Wikitext-2 - Tokenizer config: BPE (same as previous baseline experiments) - Model config (layers, d_model, heads, etc): 128 - Sequence length: 128 - Effective tokens per run: `3,693,440` - Optimizer: `adam` - Weight decay: 0 - Dropout: 0.1 - Scheduler shape: Linear - Mixed precision setting: `BF16` - Hardware (GPU type): `A100` - Seed policy: `42` --- ## 4. Independent Variable(s) | Variable | Values | | ------------- | ----------------------------- | | Layer Depth | [ `shallow`(5) , `deep`(20) ] | | LR_Warmup | [`0` , `1e-4 -> 1e-3`] | | Norm Position | [`pre` , `post`] | --- ## 5. Experimental Design ### Budget - Tokens per run: `3,693,440` - Steps per run: `902` - Max VRAM allowed: `40GB` ### Evaluation Protocol - Train loss - Validation loss - Perplexity - Per-Layer grad norm - Gradient norm (global) - BPB --- ## 6. Metrics to Log ### Core Metrics - Grad Norm Per Layer - Global grad norm - Validation Loss - Validation PP - Validation BPB ### Stability Metrics - Global gradient norm - Loss spikes --- ## 7. Expected Outcomes - Post-Norm based deeper models should see a spike in layer-grad-norm as the depth increases. - But this effect of depth dependent gradient spiking should not be found in the `pre` norm regime, leading to more stable and better training, achieving lower loss and other loss dependent metric(PP, bpb). --- ## 8. Results ### Raw Observations ![](https://r2.ashwinms.com/assets/tel/phase_0/media_images_avg_layer_grad_norm_vs_layer_index_0_450e4057b3c10c0b5097.png) ### Quantitative Summary Summary: | Run | Layers | d_model | warmup | val_loss | val_bpb | | ------------------------- | ------ | ------- | ------ | --------- | --------- | | post-ln-shallow-warmup0 | 4 | 128 | 0 | 3.689 | 2.015 | | post-ln-shallow-warmup500 | 4 | 128 | 500 | 3.397 | 1.855 | | pre-ln-shallow-warmup0 | 4 | 128 | 0 | **3.381** | **1.846** | | post-ln-deep-warmup0 | 12 | 512 | 0 | 4.401 | 2.403 | | post-ln-deep-warmup500 | 12 | 512 | 500 | 4.420 | 2.414 | | pre-ln-deep-warmup0 | 12 | 512 | 0 | **3.679** | **2.009** | --- ## 9. Interpretation **Post-LN deep, no warmup** : we can clearly see that the final layer is doing all the work/learning and the initial layers are barely, updating weights with the grads being close to 0 for post-LN deep model. This coincides with what we expect in the post-norm case. This is also evident with the val_loss numbers for the model. | Layer | grad norm | | ----- | --------- | | 0 | 7.2e-12 | | 3 | 6.1e-10 | | 6 | 2.6e-7 | | 9 | 9.7e-4 | | 11 | **0.223** | **Post-LN deep, 500 warmup** : we can see that this model actually performed a bit worse that the one without warmup. It can also be said that this model spent half its time in warmup and did not receive enough post-warmup steps , but this is indicative of the slow convergence rate when using warmup essential slowing down model training for slightly better stability. **Pre-LN deep, no warmup** : we can clearly see a stable gradient regime throughout the layers. The only spike on layer-0 which is expected and the rest of the layers show a stable avg grad norm of ~0.002. > Why is it expected that layer-0 has a spiky high gradient in pre-ln ? > This is because in pre-ln the residual stream acts as a clean-additive-highway for gradients flows backwards and the accumulating at layer 0, also the layer-0 LN is normalizing the raw-embedding and positional embedding which has never been normalized before. **Shallow Models** : barely have any different in grads or performance as predicted, since the expected effects compound with model depth. --- ## 10. Decision Going forward we will be using pre-norm transformers only. But we should be aware of how the model width and head_dims are affected as well with depth and not just with norms affecting grads.