0. Experiment Meta
Experiment Name: Optimizer Ablation Study - Adam vs AdamW
Category: Optimizer
Date: 02/03/2026
Commit Hash / Branch: 182359f
Seed(s): 42
Pre-Read And Notes: On Optimizers
1. Objective
What single question is this experiment answering?
Compare and understand how different optimizers behave on weight decay. Also try to solve some mysteries from the previous experiment.
While it is mostly obvious from literature as to which optimizer perform the best of such tasks, the goal here is to understand how weight decay effects our optimizer and make sure our specific setup/run’s don’t run into issues involving LR/Optimizer/Schedules.
We will be performing an ablation with
- 2 optimizers (Adam and AdamW)
[ 2 ] - 1 learning-rate (
3e-5) - With and without weight decay
[ 2 ]
So basically
- Adam, wd=0
- Adam, wd=wd₀
- AdamW, wd=0
- AdamW, wd=wd₀
And then,
- 1 run with SGD @ lr=
3e-5just to uncover the mysteries[ +1 ]
2. Hypothesis
Q1: Does weight decay really help in learning ?
weight-decay mainly should have a regularizing effect in the sense that it helps avoid growing param values which can lead to overfitting. This can be measured by calculating the param-norm. If weight decay is working, the model should see stable/shrinking value for but if the value of is increasing then weight decay has no effect on this regime.
Q2: If so then how is the decay mechanism of Adam different from AdamW ?
Adam applies weight decay explicitly to the gradients based on an L2 regularization (which doesn’t really have a true L2 effect see Optim-Overview ) and then uses that to calculate momentum and variance estimate, which reduces the effectiveness of weight decay, but AdamW preserves it by directly shrinking the params which serves as a better/stronger regularisation helping learning.
So when we look at for Adam vs AdamW, this plot should be diverging even with weight decay, with AdamW having lower value for . But when weight decay is 0, they are likely to be identical.
3. Controlled Variables (Frozen)
Everything below must remain identical across runs:
-
Dataset:
Wikitext-2 -
Tokenizer config: Custom BPE
encode()which uses pre-train vocab from the hg dataset. -
Model config (layers, d_model, heads, etc):
# TODO-
Sequence length:
1024 -
Batch Size : 20
-
Learning rate:
3e-5 -
Effective tokens per run:
4.1Mtokens -
Optimizer : 4. Independent Variable(s)
-
Weight decay: 4. Independent Variable(s)
-
Dropout:
0.1 -
Scheduler shape:
None -
Mixed precision setting: BF16
-
Hardware (GPU type):
A100 -
Seed policy:
42
4. Independent Variable(s)
What are we changing?
| Variable | Values |
|---|---|
| Optimizers | [ Adam, AdamW ] |
| Weight Decay | [0, 0.1] |
Why decay=
0.1? Because we only have 1000 steps to see the effect and setting it too small will not show any meaningful deviation in experiments, especially since its already going to be multipled with lr leaving the updates in a scale of even when we pick0.1
5. Experimental Design
Budget
- Tokens per run:
4.1Mon Wikitext2 - Batch size: 20
- Seq Len: 1024
- Tokens per step : 20 * 1024 = 20,480
- Steps per run: 200 steps per epoch @ 5-epochs = 1000 steps
Total of
4.1Mtokens at each step using20,480tokens would be200
-
Expected wall-clock:
UNK -
Max VRAM allowed:
40GB
6. Metrics to Log
Core Metrics
- Train loss
- Validation loss
- Perplexity
- Peak VRAM
- Gradient norm (global)
NEW
- parameter norm
- per layer param norm
- update to weight ratio
Stability Metrics
- Global gradient norm
- Loss spikes
Optional Diagnostics
- Activation norms (first/middle/last layer)
- LN weight/bias gradient norms
- Attention entropy
| What we’re testing | Leading indicator | Lagging indicator |
|---|---|---|
| Weight decay working | |θ| shrinking over time | Final val loss |
| AdamW vs Adam decay difference | Per-layer |θ| diverging between runs | Generalization gap |
| Optimizer stability | Grad norm variance | Loss spikes |
| Update efficiency | |Δθ| / |θ| | Convergence speed |
7. Expected Outcomes
The run with no weight decay is just a sanitary baseline and nothing else, both the optim should behave exactly under no weight decay.
Looking at the global and per layer should make it evident that AdamW is more effective at maintaining the scale of params across layers having a better regularising effect especially on the later/final layers which receive higher magnitude grads due to proximity to loss function and have a higher change of overfitting due to large-scale of params.
The update to weight ratio should show a divergence between adam and adamw indicating the diminishing effect of weight decay in adam vs w.
8. Results
Raw Observations

Well this happened because i tried to clone the entire model params on the GPU to calculate param norms for reporting which isnt a great idea, applying a patch to try and move/stream values and do the metrics reporting from CPU tensors.
For the detailed results see the wandb report below.
9. Interpretation
See my wandb report for this here
What do we adopt going forward?
Weight decay done properly like is AdamW does help models converge faster, avoid overfitting and helps the overall learning dynamics. But this scale of experiment is insufficient without trying out multiple other learning rates and running training for substantially longer periods. But from whatever we have observed and gathered it is fair to say we are able to confidently adopt and have learnt to track & measure the core ideas that help detect and correct any issues that may arise wrt the optimizer or update dynamics during training!
11. Follow-up Questions
- What remains unclear?
See “What we dont know” section of the wandb report above.