0. Experiment Meta

Experiment Name: Optimizer Ablation Study - Adam vs AdamW

Category: Optimizer

Date: 02/03/2026

Commit Hash / Branch: 182359f

Seed(s): 42

Pre-Read And Notes: On Optimizers


1. Objective

What single question is this experiment answering?

Compare and understand how different optimizers behave on weight decay. Also try to solve some mysteries from the previous experiment.

While it is mostly obvious from literature as to which optimizer perform the best of such tasks, the goal here is to understand how weight decay effects our optimizer and make sure our specific setup/run’s don’t run into issues involving LR/Optimizer/Schedules.

We will be performing an ablation with

  • 2 optimizers (Adam and AdamW) [ 2 ]
  • 1 learning-rate ( 3e-5 )
  • With and without weight decay [ 2 ]

So basically

  1. Adam, wd=0
  2. Adam, wd=wd₀
  3. AdamW, wd=0
  4. AdamW, wd=wd₀

And then,

  • 1 run with SGD @ lr=3e-5 just to uncover the mysteries [ +1 ]

2. Hypothesis

Q1: Does weight decay really help in learning ?

weight-decay mainly should have a regularizing effect in the sense that it helps avoid growing param values which can lead to overfitting. This can be measured by calculating the param-norm. If weight decay is working, the model should see stable/shrinking value for but if the value of is increasing then weight decay has no effect on this regime.

Q2: If so then how is the decay mechanism of Adam different from AdamW ?

Adam applies weight decay explicitly to the gradients based on an L2 regularization (which doesn’t really have a true L2 effect see Optim-Overview ) and then uses that to calculate momentum and variance estimate, which reduces the effectiveness of weight decay, but AdamW preserves it by directly shrinking the params which serves as a better/stronger regularisation helping learning.

So when we look at for Adam vs AdamW, this plot should be diverging even with weight decay, with AdamW having lower value for . But when weight decay is 0, they are likely to be identical.


3. Controlled Variables (Frozen)

Everything below must remain identical across runs:

  • Dataset: Wikitext-2

  • Tokenizer config: Custom BPE encode() which uses pre-train vocab from the hg dataset.

  • Model config (layers, d_model, heads, etc):

# TODO
  • Sequence length: 1024

  • Batch Size : 20

  • Learning rate: 3e-5

  • Effective tokens per run: 4.1M tokens

  • Optimizer : 4. Independent Variable(s)

  • Weight decay: 4. Independent Variable(s)

  • Dropout: 0.1

  • Scheduler shape: None

  • Mixed precision setting: BF16

  • Hardware (GPU type): A100

  • Seed policy: 42


4. Independent Variable(s)

What are we changing?

VariableValues
Optimizers[ Adam, AdamW ]
Weight Decay[0, 0.1]

Why decay=0.1 ? Because we only have 1000 steps to see the effect and setting it too small will not show any meaningful deviation in experiments, especially since its already going to be multipled with lr leaving the updates in a scale of even when we pick 0.1


5. Experimental Design

Budget

  • Tokens per run: 4.1M on Wikitext2
  • Batch size: 20
  • Seq Len: 1024
  • Tokens per step : 20 * 1024 = 20,480
  • Steps per run: 200 steps per epoch @ 5-epochs = 1000 steps

Total of 4.1M tokens at each step using 20,480 tokens would be 200

  • Expected wall-clock: UNK

  • Max VRAM allowed: 40GB


6. Metrics to Log

Core Metrics

  • Train loss
  • Validation loss
  • Perplexity
  • Peak VRAM
  • Gradient norm (global)

NEW

  • parameter norm
  • per layer param norm
  • update to weight ratio

Stability Metrics

  • Global gradient norm
  • Loss spikes

Optional Diagnostics

  • Activation norms (first/middle/last layer)
  • LN weight/bias gradient norms
  • Attention entropy
What we’re testingLeading indicatorLagging indicator
Weight decay working|θ| shrinking over timeFinal val loss
AdamW vs Adam decay differencePer-layer |θ| diverging between runsGeneralization gap
Optimizer stabilityGrad norm varianceLoss spikes
Update efficiency|Δθ| / |θ|Convergence speed

7. Expected Outcomes

The run with no weight decay is just a sanitary baseline and nothing else, both the optim should behave exactly under no weight decay.

Looking at the global and per layer should make it evident that AdamW is more effective at maintaining the scale of params across layers having a better regularising effect especially on the later/final layers which receive higher magnitude grads due to proximity to loss function and have a higher change of overfitting due to large-scale of params.

The update to weight ratio should show a divergence between adam and adamw indicating the diminishing effect of weight decay in adam vs w.


8. Results

Raw Observations

Well this happened because i tried to clone the entire model params on the GPU to calculate param norms for reporting which isnt a great idea, applying a patch to try and move/stream values and do the metrics reporting from CPU tensors.

For the detailed results see the wandb report below.


9. Interpretation

See my wandb report for this here

What do we adopt going forward?

Weight decay done properly like is AdamW does help models converge faster, avoid overfitting and helps the overall learning dynamics. But this scale of experiment is insufficient without trying out multiple other learning rates and running training for substantially longer periods. But from whatever we have observed and gathered it is fair to say we are able to confidently adopt and have learnt to track & measure the core ideas that help detect and correct any issues that may arise wrt the optimizer or update dynamics during training!


11. Follow-up Questions

  • What remains unclear?

See “What we dont know” section of the wandb report above.