This is the “Readme” for Baseline/phase-0 , will be documenting progress and such here.
Goals
- Trace and understand lineage of architectural evolution and design choices.
- Implement a modern-minimal baseline language model
- Train it auto-regressively.
- Track/care about evals.
- Make it generate text!!!
Progress
- 1 Lineage of the Transformer
- 2 Implementing a Baseline Model
- 3 Training GPT-2 with bigger data
- 4 Baseline Experiments
Note: Find links and other useful resource for this phase in Resources
Completed ✔️
Conclusions & Locked Configs
These are the empirically validated settings from Phase 0 experiments. Carry these forward as Phase 1 defaults unless explicitly ablating them.
Architecture
| Parameter | Value | Source |
|---|---|---|
norm_placement | pre | pre-vs-post-layer-norm-2 group |
attention_impl | sdpa | Memory & GPU experiments |
dtype | bfloat16 | All runs |
torch_compile | True, mode=default | Activation checkpointing experiments |
Norm placement: Pre-LN with a final decoder LN converges stably without warmup. Post-LN requires warmup and produces ~10 orders of magnitude variation in per-layer gradient norms on deep models without it. Pre-LN gradient norms are O(1) across layers with an expected spike at layer 0 (residual highway from unnormalized embeddings). See Pre vs Post Norms, With & Without Warmups.
Training Recipe
| Parameter | Value | Source |
|---|---|---|
optimizer | AdamW | Optimizer sweep |
lr_warmup_steps | 0 (with Pre-LN) | Pre vs Post LN experiment |
weight_decay | 0 (revisit for Phase 1 scale) | Weight decay experiments |
accumulation_steps | 1 | Gradient quality experiment |
Gradient accumulation: On wikitext-2, gradient coherence is >0.999 at all effective batch sizes tested. The critical batch size is below mb=28, meaning accumulation provides no quality benefit on this dataset. This finding is dataset-specific — revisit on larger, more heterogeneous datasets in Phase 1. Run group: 4-effective-batch-fixed-lr-20260316-151845-575080.
Memory & Compute (A100 40GB, d_model=768, seq_len=1024, n_layers=12)
⚠️ Note on measurement validity: Three frontier sweeps were run. The first (
memory-frontier-step1-compile-budget-20260313-091942-166699) hadcompile_warmup_steps=0anddata_fraction=0.1, making timing metrics unreliable due to JIT compile bleed. Groups 3 and 4 below are the valid results.
No-AC frontier (canonical, summary_stage=final):
W&B group: 4-memory-frontier-step1-compile-20260314-143138-930527
| mb | peak_reserved | Notes |
|---|---|---|
| 16 | 22.5 GiB | Safe, lower throughput |
| 28 | 38.1 GiB | Fast but ~2 GiB OOM margin — risky |
| 30 | 38.95 GiB | Effectively at the limit |
| 32+ | OOM | — |
AC budget frontier:
W&B group: 3-memory-frontier-step1-compile-budget-20260314-091315-256427
| AC budget | mb | peak_reserved | Relative throughput |
|---|---|---|---|
| 0.12 | 32 | 25.7 GiB | Fast |
| 0.38 | 32 | 30.2 GiB | Fast |
| 0.75 | 32 | 34.2 GiB | Fastest |
| 0.75 | 40+ | OOM | — |
Recommended operating point for Phase 1: activation_memory_budget=0.75, micro_batch_size=32. Gives the highest throughput with 6 GiB headroom. Validate with a fresh 200-step timing run before committing a long training job, as the AC timing data came from a different host than the final no-AC validation.
Note on
avg_step_time_ms_epochvs wall-clock: This metric excludes compile warmup steps (the first 3 steps are dramatically slower during JIT compilation). Epoch-level wall-clock time is the more reliable throughput estimate. The ~2–4× discrepancy between step-average and wall-clock is compile overhead, not measurement error.
What Doesn’t Work / Dataset Limitations
- wikitext-2 is too small for architectural ablations. At ~2M tokens, ~30 optimizer steps per epoch at Phase 1 config. Architectural differences (RoPE vs ALiBi, MHA vs GQA) will not surface at quality level. Move to OpenWebText or FineWeb-Edu for Phase 1 Stage 2+.
peak_allocatedis unreliable with torch.compile. Usepeak_reservedfor OOM proximity assessment.peak_allocatedis non-monotonic due to compile-time scratch buffer inflation before the measurement window. Fix: calltorch.cuda.reset_peak_memory_stats()after warmup steps.- Loss-vs-tokens comparisons require LR scaling. Without Goyal et al. linear scaling (or McCandlish et al. sqrt rule), loss-vs-tokens conflates batch size and update frequency effects. Comparisons across effective batch sizes are only valid when LR is co-scaled.
Closing Thoughts
Going through this has helped fundamentally understand, learn and experience what it feels like to pipeline-data, transfer research reads into code, train models on cloud-gpu and really learn to read the “learning curves” on wandb.
I also had setup the repo structure and other meta experiments for how things should be along the way during this phase, which took some extra time but was totally worth it.
I move forward knowing that i can confidently understand and train simple models and implement research ideas from paper and see them through to the metrics. Have covered the baseline, I’m excited to delve deeper and wider into different optimization the transformer has gone through onwards 🚀