0. Experiment Meta
Experiment Name:
Category: (Hyperparam / Optimizer / Memory / Architecture / Training Recipe)
Date:
Commit Hash / Branch:
Seed(s):
1. Objective
What single question is this experiment answering?
Example:
Does increasing batch size allow a proportionally larger stable learning rate?
2. Hypothesis
Mechanism: [what process are we claiming is happening]
Prediction: [what specific metrics will show this, and in what direction]
Falsification: [what would we see if the hypothesis is wrong]
3. Controlled Variables (Frozen)
Everything below must remain identical across runs:
-
Dataset: Wikitext-2
-
Tokenizer config:
-
Model config (layers, d_model, heads, etc):
-
Sequence length:
-
Effective tokens per run:
-
Optimizer (unless this is the optimizer experiment):
-
Weight decay:
-
Dropout:
-
Scheduler shape:
-
Mixed precision setting:
-
Hardware (GPU type):
-
Seed policy:
4. Independent Variable(s)
What are we changing?
| Variable | Values |
|---|---|
5. Experimental Design
Budget
-
Tokens per run:
-
Steps per run:
-
Expected wall-clock:
-
Max VRAM allowed:
Evaluation Protocol
-
Train loss
-
Validation loss
-
Perplexity
-
Throughput (tokens/sec)
-
Peak VRAM
-
Gradient norm (global)
-
(Optional) Layer-wise grad norms
Comparison Axis
-
☐ Equal tokens seen
-
☐ Equal optimizer steps
-
☐ Equal wall-clock time
(Choose one.)
6. Metrics to Log
Core Metrics
-
Train loss vs tokens
-
Val loss vs tokens
-
Perplexity
-
Step time
-
Peak memory
Stability Metrics
-
Global gradient norm
-
NaNs / divergence events
-
Loss spikes
Optional Diagnostics
-
Activation norms (first/middle/last layer)
-
LN weight/bias gradient norms
-
Attention entropy
7. Expected Outcomes
Write what you expect before running it.
Example:
Small batch + high LR → unstable
Large batch + scaled LR → smoother convergence
Diminishing returns beyond batch X
8. Results
Raw Observations
(Write qualitative observations first.)
Quantitative Summary
| Config | Final Val Loss | Final PPL | Tokens/sec | Peak VRAM | Notes |
|---|---|---|---|---|---|
9. Interpretation
What actually happened?
-
Did results match hypothesis?
-
Any instability?
-
Any unexpected interactions?
-
Was performance compute-efficient?
10. Decision
What do we adopt going forward?
-
Selected LR:
-
Selected batch size:
-
Selected optimizer:
-
Any architectural preference:
11. Follow-up Questions
-
What remains unclear?
-
What needs a deeper sweep?
-
Does this move to Phase-1 or require refinement?
Overfit Sanity Template (Quick Preflight)
Before major architecture/memory changes.
Dataset Slice:
Batch Size:
Steps Run:
Pass Criteria
-
Loss → near zero
-
No NaNs
-
Gradients finite
Result
-
☐ Pass
-
☐ Fail
Memory / Systems Experiment Addendum
Use this section only for SDPA, checkpointing, accumulation.
Measurement Conditions
-
Fixed tokens per run:
-
Same seed:
-
Same effective batch:
Compare
| Config | Step Time | Tokens/sec | Peak VRAM | Final Val Loss |
|---|---|---|---|---|
Conclusion
-
Memory saved:
-
Speed change:
-
Any training behavior difference: