Experiments with Weight Decay

0. Experiment Meta

Experiment Name: Optimizer Ablation Study - Adam vs AdamW

Category: Optimizer

Date: 02/03/2026

Commit Hash / Branch: 182359f

Seed(s): 42

Pre-Read And Notes: On Optimizers

1. Objective

What single question is this experiment answering?

Compare and understand how different optimizers behave on weight decay. Also try to solve some mysteries from the previous experiment.

While it is mostly obvious from literature as to which optimizer perform the best of such tasks, the goal here is to understand how weight decay effects our optimizer and make sure our specific setup/run’s don’t run into issues involving LR/Optimizer/Schedules.

We will be performing an ablation with

2 optimizers (Adam and AdamW) [ 2 ]
1 learning-rate ( 3e-5 )
With and without weight decay [ 2 ]

So basically

Adam, wd=0
Adam, wd=wd₀
AdamW, wd=0
AdamW, wd=wd₀

And then,

1 run with SGD @ lr=3e-5 just to uncover the mysteries [ +1 ]

2. Hypothesis

Q1: Does weight decay really help in learning ?

weight-decay mainly should have a regularizing effect in the sense that it helps avoid growing param values which can lead to overfitting. This can be measured by calculating the $∣∣ θ ∣∣$ param-norm. If weight decay is working, the model should see stable/shrinking value for $∣∣ θ ∣∣$ but if the value of $∣∣ θ ∣∣$ is increasing then weight decay has no effect on this regime.

Q2: If so then how is the decay mechanism of Adam different from AdamW ?

Adam applies weight decay explicitly to the gradients based on an L2 regularization (which doesn’t really have a true L2 effect see Optim-Overview ) and then uses that to calculate momentum and variance estimate, which reduces the effectiveness of weight decay, but AdamW preserves it by directly shrinking the params which serves as a better/stronger regularisation helping learning.

So when we look at $∣∣ θ ∣∣$ for Adam vs AdamW, this plot should be diverging even with weight decay, with AdamW having lower value for $∣∣ θ ∣∣$ . But when weight decay is 0, they are likely to be identical.

3. Controlled Variables (Frozen)

Everything below must remain identical across runs:

Dataset: Wikitext-2
Tokenizer config: Custom BPE encode() which uses pre-train vocab from the hg dataset.
Model config (layers, d_model, heads, etc):

# TODO

Sequence length: 1024
Batch Size : 20
Learning rate: 3e-5
Effective tokens per run: 4.1M tokens
Optimizer : 4. Independent Variable(s)
Weight decay: 4. Independent Variable(s)
Dropout: 0.1
Scheduler shape: None
Mixed precision setting: BF16
Hardware (GPU type): A100
Seed policy: 42

4. Independent Variable(s)

What are we changing?

Variable	Values
Optimizers	`[ Adam, AdamW ]`
Weight Decay	`[0, 0.1]`

Why decay=0.1 ? Because we only have 1000 steps to see the effect and setting it too small will not show any meaningful deviation in experiments, especially since its already going to be multipled with lr leaving the updates in a scale of $1 0^{- 6}$ even when we pick 0.1

5. Experimental Design

Budget

Tokens per run: 4.1M on Wikitext2
Batch size: 20
Seq Len: 1024
Tokens per step : 20 * 1024 = 20,480
Steps per run: 200 steps per epoch @ 5-epochs = 1000 steps

Total of 4.1M tokens at each step using 20,480 tokens would be 200

Expected wall-clock: UNK
Max VRAM allowed: 40GB

6. Metrics to Log

Core Metrics

Train loss
Validation loss
Perplexity
Peak VRAM
Gradient norm (global)

NEW

parameter norm $∣∣ θ ∣∣$
per layer param norm $∣∣ θ_{l} ∣∣$
update to weight ratio $∣∣Δ θ ∣∣/∣∣ θ ∣∣$

Stability Metrics

Global gradient norm
Loss spikes

Optional Diagnostics

Activation norms (first/middle/last layer)
LN weight/bias gradient norms
Attention entropy

What we’re testing	Leading indicator	Lagging indicator
Weight decay working	\|θ\| shrinking over time	Final val loss
AdamW vs Adam decay difference	Per-layer \|θ\| diverging between runs	Generalization gap
Optimizer stability	Grad norm variance	Loss spikes
Update efficiency	\|Δθ\| / \|θ\|	Convergence speed

7. Expected Outcomes

The run with no weight decay is just a sanitary baseline and nothing else, both the optim should behave exactly under no weight decay.

Looking at the global $∣∣ θ ∣∣$ and per layer $∣∣ θ ∣∣$ should make it evident that AdamW is more effective at maintaining the scale of params across layers having a better regularising effect especially on the later/final layers which receive higher magnitude grads due to proximity to loss function and have a higher change of overfitting due to large-scale of params.

The update to weight ratio should show a divergence between adam and adamw indicating the diminishing effect of weight decay in adam vs w.

8. Results

Raw Observations

Well this happened because i tried to clone the entire model params on the GPU to calculate param norms for reporting which isnt a great idea, applying a patch to try and move/stream values and do the metrics reporting from CPU tensors.

For the detailed results see the wandb report below.

9. Interpretation

See my wandb report for this here

What do we adopt going forward?

Weight decay done properly like is AdamW does help models converge faster, avoid overfitting and helps the overall learning dynamics. But this scale of experiment is insufficient without trying out multiple other learning rates and running training for substantially longer periods. But from whatever we have observed and gathered it is fair to say we are able to confidently adopt and have learnt to track & measure the core ideas that help detect and correct any issues that may arise wrt the optimizer or update dynamics during training!

11. Follow-up Questions

What remains unclear?

See “What we dont know” section of the wandb report above.