LR vs Batch Size Empirical Sweep

Read Premise here and for a better big-picture context see this.

0. Experiment Meta

Experiment Name: LR vs Batch Size Empirical Sweep

Category: Hyperparam

Date: 27/02/2026

Commit Hash / Branch: 2bacec8

Seed(s): 42

1. Objective

What single question is this experiment answering?

ref

A matrix comparison between different batch-sizes and different learning rates. The goal is to determine a batch_size that has max gpu utilization and a stable range of lr for AdamW

2. Hypothesis

Find practical values for LR<>Batch_Size combinations which helps us optimally use the given hardware.

On a size note, this experiment would be more fruitful if re-run after the implementation of grad-accumulation with microbatching.

3. Controlled Variables (Frozen)

Everything below must remain identical across runs:

Dataset: Wikitext-2
Tokenizer config: Custom BPE encode() which uses pre-train vocab from the hg dataset.
Model config (layers, d_model, heads, etc):

ExperimentConfig(
        run=RunConfig(
            project_name="transformer-room-baseline",
            artifacts_root=str(project_root / "baseline" / "models"),
            run_name="p0-LRvsBSz-wikitext2_gpt2",
            resume_from_checkpoint=False,
            checkpoint_every_n_steps=10000,
            use_torch_compile=False,
            torch_compile_mode="default",
            torch_compile_fullgraph=False,
            torch_compile_dynamic=False,
            seed=42,
        ),
        dataset=HFTextDatasetConfig(
            dataset_name=dataset_name,
            dataset_config=dataset_config,
            split="train",
            text_field="text",
        ),
        tokenizer=BPETokenizerConfig(
            base_vocab_size=base_vocab_size,
            num_special_tokens=3,
            vocab_path=str(vocab_path),
        ),
        model=BaselineDecoderConfig(
            d_model=768,
            n_heads=8,
            layers=12,
            dropout=0.1,
        ),
        train=TrainConfig(
            epochs=5,
            learning_rate=0.001, # INDEPENDENT VAR
            batch_size=64, # INDEPENDENT VAR
            seq_len=1024,
            stride=1024,
            data_fraction=1.0,
        ),
        split=HoldoutSplitConfig(
            train_fraction=0.9,
            seed=42,
            shuffle=False,
        ),
        logging=LoggingConfig(provider="wandb"),
    )

Sequence length: 1024
Effective tokens per run: ! DEPENDENT VARIABLE ! will be included in 4. Independent Variable(s) section below.
Optimizer (unless this is the optimizer experiment): AdamW
Weight decay: None
Dropout: 0.1
Scheduler shape: constant
Mixed precision setting: BF16
Hardware (GPU type): Nvidia A100
Seed policy: fixed 42

4. Independent Variable(s)

What are we changing?

Since lr is multiplicative in nature we will use a log-scale of values for the sweep.

Variable	Values
learning_rate	`[1e-5, 3e-5, 1e-4, 3e-4, 1e-3]`
batch_size	`[ 12, 20 ]`

We will directly assume the batch_size to be 20 for now based on our assumptions in the Estimates here . This axis will be revisited once we have micro-batching+grad-acc.

5. Experimental Design

Budget

Tokens per run: 4.1M on Wikitext2
Batch size: 20
Seq Len: 1024
Tokens per step : 20 * 1024 = 20,480
Steps per run: 200 steps

Total of 4.1M tokens at each step using 20,480 tokens would be 200

Expected wall-clock:
- ~2min10sec * no_of_runs
- ~130sec * 10 = 21-22mins
- So that would be around 2.05M steps in total.
Max VRAM allowed:
- Using an A100 this is capped at the max capacity of 40GB

Evaluation Protocol

Train loss
Validation loss
Perplexity
Throughput (tokens/sec)
Peak VRAM
Gradient norm (global)
(Optional) Layer-wise grad norms

Comparison Axis

Equal tokens seen
Equal optimizer steps
Equal wall-clock time ( Not necessarily, confounding variable in this case.)

6. Metrics to Log

Core Metrics

Train loss vs tokens
Val loss vs tokens
Perplexity
Step time
Peak memory

Stability Metrics

Global gradient norm
NaNs / divergence events
Loss spikes

Optional Diagnostics

Activation norms (first/middle/last layer)
LN weight/bias gradient norms
Attention entropy

7. Expected Outcomes

Write what you expect before running it.

Small batch + high LR → unstable

Large batch + scaled LR → smoother convergence (probably needs grad-acc)

Diminishing returns beyond batch X (i.e: finding critical batch_size surely needs grad-acc with a single A100)

The only thing we can take away from the experiment for now in a optimal rage for lr on AdamW and see how batch-size effects learning.

8. Results

AND

9. Interpretation

I learned to use wandb reports! And find that to be the most effective way to communicate this information, so please head over to here

What do we adopt going forward?

Selected LR: 3e-05
Selected batch size: 20
Selected optimizer: Adam ( the only one we used for now)
Any architectural preference: No changes, retain everything.

I would like to expand on the decision of batch size 20 over 12 for the lr 1e-05, it is clear that batch size 12 performs better across most graphs, but we should note that the gains are largely due to the increased number of optim steps that a batch-12 has over batch-20, hence lr is predominantly the factor rather than batch-size working the main magic. Also its important to note that Attention-Entropy of the last layer for a smaller batch-12 size starts to drop off steeply which is undesirable indicating over-specializing heads, while batch-20 tapers off a lot slower making it more desirable.

Also picking between 12 and 20 probably will not make much of a difference, picking the larger one based on literature is still a safe bet, we will be working on expanding the batch-size with “gradient accumulation” soon.

11. Follow-up Questions

What remains unclear?

While i guess that the sharp drop in the loss/pp during the 500-700 step mark is due to Adam’s momentum, there is no good way to prove this with the metrics right now. It could also be that the model found a better & steeper loss surface at the time (step 680) and then we global grad-norms rising (step 650→700) but due to the difference in sampling frequency of the 2 metrics its hard to tell since they both happen in the same 50 step window.

This can we verified in the next optimizer ablation study by swapping the optim with SGD, if the drop really did come from Adam’s momentum then it shouldnt appear in SGD.
Also we can track another adam metric, “effective step” i.e: the scalar multiplier to the grad updates, and compare it against the grad norm we currently have. If it was really the Adam optim causing the drop, we should see an increase in the “effective step” and not the grad-norm increasing.

What needs a deeper sweep?

Optim Choice