0. Experiment Meta

Experiment Name:
Category: (Hyperparam / Optimizer / Memory / Architecture / Training Recipe)
Date:
Commit Hash / Branch:
Seed(s):


1. Objective

What single question is this experiment answering?

Example:
Does increasing batch size allow a proportionally larger stable learning rate?


2. Hypothesis

Mechanism: [what process are we claiming is happening]
Prediction: [what specific metrics will show this, and in what direction]
Falsification: [what would we see if the hypothesis is wrong]

3. Controlled Variables (Frozen)

Everything below must remain identical across runs:

  • Dataset: Wikitext-2

  • Tokenizer config:

  • Model config (layers, d_model, heads, etc):

  • Sequence length:

  • Effective tokens per run:

  • Optimizer (unless this is the optimizer experiment):

  • Weight decay:

  • Dropout:

  • Scheduler shape:

  • Mixed precision setting:

  • Hardware (GPU type):

  • Seed policy:


4. Independent Variable(s)

What are we changing?

VariableValues

5. Experimental Design

Budget

  • Tokens per run:

  • Steps per run:

  • Expected wall-clock:

  • Max VRAM allowed:

Evaluation Protocol

  • Train loss

  • Validation loss

  • Perplexity

  • Throughput (tokens/sec)

  • Peak VRAM

  • Gradient norm (global)

  • (Optional) Layer-wise grad norms

Comparison Axis

  • ☐ Equal tokens seen

  • ☐ Equal optimizer steps

  • ☐ Equal wall-clock time

(Choose one.)


6. Metrics to Log

Core Metrics

  • Train loss vs tokens

  • Val loss vs tokens

  • Perplexity

  • Step time

  • Peak memory

Stability Metrics

  • Global gradient norm

  • NaNs / divergence events

  • Loss spikes

Optional Diagnostics

  • Activation norms (first/middle/last layer)

  • LN weight/bias gradient norms

  • Attention entropy


7. Expected Outcomes

Write what you expect before running it.

Example:

  • Small batch + high LR → unstable

  • Large batch + scaled LR → smoother convergence

  • Diminishing returns beyond batch X


8. Results

Raw Observations

(Write qualitative observations first.)

Quantitative Summary

ConfigFinal Val LossFinal PPLTokens/secPeak VRAMNotes

9. Interpretation

What actually happened?

  • Did results match hypothesis?

  • Any instability?

  • Any unexpected interactions?

  • Was performance compute-efficient?


10. Decision

What do we adopt going forward?

  • Selected LR:

  • Selected batch size:

  • Selected optimizer:

  • Any architectural preference:


11. Follow-up Questions

  • What remains unclear?

  • What needs a deeper sweep?

  • Does this move to Phase-1 or require refinement?



Overfit Sanity Template (Quick Preflight)

Before major architecture/memory changes.

Dataset Slice:
Batch Size:
Steps Run:

Pass Criteria

  • Loss → near zero

  • No NaNs

  • Gradients finite

Result

  • ☐ Pass

  • ☐ Fail



Memory / Systems Experiment Addendum

Use this section only for SDPA, checkpointing, accumulation.

Measurement Conditions

  • Fixed tokens per run:

  • Same seed:

  • Same effective batch:

Compare

ConfigStep TimeTokens/secPeak VRAMFinal Val Loss

Conclusion

  • Memory saved:

  • Speed change:

  • Any training behavior difference: