0. Experiment Meta

Experiment Name:
Category: (Hyperparam / Optimizer / Memory / Architecture / Training Recipe)
Date:
Commit Hash / Branch:
Seed(s):

1. Objective

What single question is this experiment answering?

Example:
Does increasing batch size allow a proportionally larger stable learning rate?

2. Hypothesis

Mechanism: [what process are we claiming is happening]
Prediction: [what specific metrics will show this, and in what direction]
Falsification: [what would we see if the hypothesis is wrong]

3. Controlled Variables (Frozen)

Everything below must remain identical across runs:

Dataset: Wikitext-2
Tokenizer config:
Model config (layers, d_model, heads, etc):
Sequence length:
Effective tokens per run:
Optimizer (unless this is the optimizer experiment):
Weight decay:
Dropout:
Scheduler shape:
Mixed precision setting:
Hardware (GPU type):
Seed policy:

4. Independent Variable(s)

What are we changing?

Variable	Values

5. Experimental Design

Budget

Tokens per run:
Steps per run:
Expected wall-clock:
Max VRAM allowed:

Evaluation Protocol

Train loss
Validation loss
Perplexity
Throughput (tokens/sec)
Peak VRAM
Gradient norm (global)
(Optional) Layer-wise grad norms

Comparison Axis

☐ Equal tokens seen
☐ Equal optimizer steps
☐ Equal wall-clock time

(Choose one.)

6. Metrics to Log

Core Metrics

Train loss vs tokens
Val loss vs tokens
Perplexity
Step time
Peak memory

Stability Metrics

Global gradient norm
NaNs / divergence events
Loss spikes

Optional Diagnostics

Activation norms (first/middle/last layer)
LN weight/bias gradient norms
Attention entropy

7. Expected Outcomes

Write what you expect before running it.

Example:

Small batch + high LR → unstable

Large batch + scaled LR → smoother convergence

Diminishing returns beyond batch X

8. Results

Raw Observations

(Write qualitative observations first.)

Quantitative Summary

Config	Final Val Loss	Final PPL	Tokens/sec	Peak VRAM	Notes

9. Interpretation

What actually happened?

Did results match hypothesis?
Any instability?
Any unexpected interactions?
Was performance compute-efficient?

10. Decision

What do we adopt going forward?

Selected LR:
Selected batch size:
Selected optimizer:
Any architectural preference:

11. Follow-up Questions

What remains unclear?
What needs a deeper sweep?
Does this move to Phase-1 or require refinement?

Overfit Sanity Template (Quick Preflight)

Before major architecture/memory changes.

Dataset Slice:
Batch Size:
Steps Run:

Pass Criteria

Loss → near zero
No NaNs
Gradients finite

Result

☐ Pass
☐ Fail

Memory / Systems Experiment Addendum

Use this section only for SDPA, checkpointing, accumulation.

Measurement Conditions

Fixed tokens per run:
Same seed:
Same effective batch:

Compare

Config	Step Time	Tokens/sec	Peak VRAM	Final Val Loss

Conclusion

Memory saved:
Speed change:
Any training behavior difference:

Experiment Template