Read Premise here and for a better big-picture context see this.
0. Experiment Meta
Experiment Name: LR vs Batch Size Empirical Sweep
Category: Hyperparam
Date: 27/02/2026
Commit Hash / Branch: 2bacec8
Seed(s): 42
1. Objective
What single question is this experiment answering?
A matrix comparison between different batch-sizes and different learning rates. The goal is to determine a batch_size that has max gpu utilization and a stable range of
lrforAdamW
2. Hypothesis
Find practical values for LR<>Batch_Size combinations which helps us optimally use the given hardware.
On a size note, this experiment would be more fruitful if re-run after the implementation of grad-accumulation with microbatching.
3. Controlled Variables (Frozen)
Everything below must remain identical across runs:
-
Dataset:
Wikitext-2 -
Tokenizer config: Custom BPE
encode()which uses pre-train vocab from the hg dataset. -
Model config (layers, d_model, heads, etc):
ExperimentConfig(
run=RunConfig(
project_name="transformer-room-baseline",
artifacts_root=str(project_root / "baseline" / "models"),
run_name="p0-LRvsBSz-wikitext2_gpt2",
resume_from_checkpoint=False,
checkpoint_every_n_steps=10000,
use_torch_compile=False,
torch_compile_mode="default",
torch_compile_fullgraph=False,
torch_compile_dynamic=False,
seed=42,
),
dataset=HFTextDatasetConfig(
dataset_name=dataset_name,
dataset_config=dataset_config,
split="train",
text_field="text",
),
tokenizer=BPETokenizerConfig(
base_vocab_size=base_vocab_size,
num_special_tokens=3,
vocab_path=str(vocab_path),
),
model=BaselineDecoderConfig(
d_model=768,
n_heads=8,
layers=12,
dropout=0.1,
),
train=TrainConfig(
epochs=5,
learning_rate=0.001, # INDEPENDENT VAR
batch_size=64, # INDEPENDENT VAR
seq_len=1024,
stride=1024,
data_fraction=1.0,
),
split=HoldoutSplitConfig(
train_fraction=0.9,
seed=42,
shuffle=False,
),
logging=LoggingConfig(provider="wandb"),
)-
Sequence length:
1024 -
Effective tokens per run: ! DEPENDENT VARIABLE ! will be included in 4. Independent Variable(s) section below.
-
Optimizer (unless this is the optimizer experiment): AdamW
-
Weight decay:
None -
Dropout: 0.1
-
Scheduler shape:
constant -
Mixed precision setting:
BF16 -
Hardware (GPU type):
Nvidia A100 -
Seed policy: fixed 42
4. Independent Variable(s)
What are we changing?
Since
lris multiplicative in nature we will use a log-scale of values for the sweep.
| Variable | Values |
|---|---|
| learning_rate | [1e-5, 3e-5, 1e-4, 3e-4, 1e-3] |
| batch_size | [ 12, 20 ] |
We will directly assume the batch_size to be
20for now based on our assumptions in the Estimates here . This axis will be revisited once we have micro-batching+grad-acc.
5. Experimental Design
Budget
- Tokens per run:
4.1Mon Wikitext2 - Batch size: 20
- Seq Len: 1024
- Tokens per step : 20 * 1024 = 20,480
- Steps per run: 200 steps
Total of
4.1Mtokens at each step using20,480tokens would be200
-
Expected wall-clock:
~2min10sec * no_of_runs~130sec * 10 = 21-22mins- So that would be around
2.05Msteps in total.
-
Max VRAM allowed:
- Using an
A100this is capped at the max capacity of40GB
- Using an
Evaluation Protocol
- Train loss
- Validation loss
- Perplexity
- Throughput (tokens/sec)
- Peak VRAM
- Gradient norm (global)
- (Optional) Layer-wise grad norms
Comparison Axis
-
Equal tokens seen
-
Equal optimizer steps
-
Equal wall-clock time ( Not necessarily, confounding variable in this case.)
6. Metrics to Log
Core Metrics
- Train loss vs tokens
- Val loss vs tokens
- Perplexity
- Step time
- Peak memory
Stability Metrics
- Global gradient norm
- NaNs / divergence events
- Loss spikes
Optional Diagnostics
- Activation norms (first/middle/last layer)
- LN weight/bias gradient norms
- Attention entropy
7. Expected Outcomes
Write what you expect before running it.
Small batch + high LR → unstable
Large batch + scaled LR → smoother convergence (probably needs grad-acc)
Diminishing returns beyond batch X (i.e: finding critical batch_size surely needs grad-acc with a single A100)
The only thing we can take away from the experiment for now in a optimal rage for lr on AdamW and see how batch-size effects learning.
8. Results
AND
9. Interpretation
I learned to use wandb reports! And find that to be the most effective way to communicate this information, so please head over to here
What do we adopt going forward?
- Selected LR:
3e-05 - Selected batch size:
20 - Selected optimizer:
Adam( the only one we used for now) - Any architectural preference:
No changes, retain everything.
I would like to expand on the decision of batch size 20 over 12 for the lr 1e-05, it is clear that batch size 12 performs better across most graphs, but we should note that the gains are largely due to the increased number of optim steps that a batch-12 has over batch-20, hence lr is predominantly the factor rather than batch-size working the main magic. Also its important to note that Attention-Entropy of the last layer for a smaller batch-12 size starts to drop off steeply which is undesirable indicating over-specializing heads, while batch-20 tapers off a lot slower making it more desirable.
Also picking between 12 and 20 probably will not make much of a difference, picking the larger one based on literature is still a safe bet, we will be working on expanding the batch-size with “gradient accumulation” soon.
11. Follow-up Questions
What remains unclear?
While i guess that the sharp drop in the loss/pp during the 500-700 step mark is due to Adam’s momentum, there is no good way to prove this with the metrics right now. It could also be that the model found a better & steeper loss surface at the time (step 680) and then we global grad-norms rising (step 650→700) but due to the difference in sampling frequency of the 2 metrics its hard to tell since they both happen in the same 50 step window.
- This can we verified in the next optimizer ablation study by swapping the optim with SGD, if the drop really did come from Adam’s momentum then it shouldnt appear in SGD.
- Also we can track another adam metric, “effective step” i.e: the scalar multiplier to the grad updates, and compare it against the grad norm we currently have. If it was really the Adam optim causing the drop, we should see an increase in the “effective step” and not the grad-norm increasing.
What needs a deeper sweep?
- Optim Choice