This is the “Readme” for Baseline/phase-0 , will be documenting progress and such here.

Goals

  1. Trace and understand lineage of architectural evolution and design choices.
  2. Implement a modern-minimal baseline language model
  3. Train it auto-regressively.
  4. Track/care about evals.
  5. Make it generate text!!!

Progress

Note: Find links and other useful resource for this phase in Resources

Completed ✔️


Conclusions & Locked Configs

These are the empirically validated settings from Phase 0 experiments. Carry these forward as Phase 1 defaults unless explicitly ablating them.

Architecture

ParameterValueSource
norm_placementprepre-vs-post-layer-norm-2 group
attention_implsdpaMemory & GPU experiments
dtypebfloat16All runs
torch_compileTrue, mode=defaultActivation checkpointing experiments

Norm placement: Pre-LN with a final decoder LN converges stably without warmup. Post-LN requires warmup and produces ~10 orders of magnitude variation in per-layer gradient norms on deep models without it. Pre-LN gradient norms are O(1) across layers with an expected spike at layer 0 (residual highway from unnormalized embeddings). See Pre vs Post Norms, With & Without Warmups.

Training Recipe

ParameterValueSource
optimizerAdamWOptimizer sweep
lr_warmup_steps0 (with Pre-LN)Pre vs Post LN experiment
weight_decay0 (revisit for Phase 1 scale)Weight decay experiments
accumulation_steps1Gradient quality experiment

Gradient accumulation: On wikitext-2, gradient coherence is >0.999 at all effective batch sizes tested. The critical batch size is below mb=28, meaning accumulation provides no quality benefit on this dataset. This finding is dataset-specific — revisit on larger, more heterogeneous datasets in Phase 1. Run group: 4-effective-batch-fixed-lr-20260316-151845-575080.

Memory & Compute (A100 40GB, d_model=768, seq_len=1024, n_layers=12)

⚠️ Note on measurement validity: Three frontier sweeps were run. The first (memory-frontier-step1-compile-budget-20260313-091942-166699) had compile_warmup_steps=0 and data_fraction=0.1, making timing metrics unreliable due to JIT compile bleed. Groups 3 and 4 below are the valid results.

No-AC frontier (canonical, summary_stage=final): W&B group: 4-memory-frontier-step1-compile-20260314-143138-930527

mbpeak_reservedNotes
1622.5 GiBSafe, lower throughput
2838.1 GiBFast but ~2 GiB OOM margin — risky
3038.95 GiBEffectively at the limit
32+OOM

AC budget frontier: W&B group: 3-memory-frontier-step1-compile-budget-20260314-091315-256427

AC budgetmbpeak_reservedRelative throughput
0.123225.7 GiBFast
0.383230.2 GiBFast
0.753234.2 GiBFastest
0.7540+OOM

Recommended operating point for Phase 1: activation_memory_budget=0.75, micro_batch_size=32. Gives the highest throughput with 6 GiB headroom. Validate with a fresh 200-step timing run before committing a long training job, as the AC timing data came from a different host than the final no-AC validation.

Note on avg_step_time_ms_epoch vs wall-clock: This metric excludes compile warmup steps (the first 3 steps are dramatically slower during JIT compilation). Epoch-level wall-clock time is the more reliable throughput estimate. The ~2–4× discrepancy between step-average and wall-clock is compile overhead, not measurement error.

What Doesn’t Work / Dataset Limitations

  • wikitext-2 is too small for architectural ablations. At ~2M tokens, ~30 optimizer steps per epoch at Phase 1 config. Architectural differences (RoPE vs ALiBi, MHA vs GQA) will not surface at quality level. Move to OpenWebText or FineWeb-Edu for Phase 1 Stage 2+.
  • peak_allocated is unreliable with torch.compile. Use peak_reserved for OOM proximity assessment. peak_allocated is non-monotonic due to compile-time scratch buffer inflation before the measurement window. Fix: call torch.cuda.reset_peak_memory_stats() after warmup steps.
  • Loss-vs-tokens comparisons require LR scaling. Without Goyal et al. linear scaling (or McCandlish et al. sqrt rule), loss-vs-tokens conflates batch size and update frequency effects. Comparisons across effective batch sizes are only valid when LR is co-scaled.

Closing Thoughts

Going through this has helped fundamentally understand, learn and experience what it feels like to pipeline-data, transfer research reads into code, train models on cloud-gpu and really learn to read the “learning curves” on wandb.

I also had setup the repo structure and other meta experiments for how things should be along the way during this phase, which took some extra time but was totally worth it.

I move forward knowing that i can confidently understand and train simple models and implement research ideas from paper and see them through to the metrics. Have covered the baseline, I’m excited to delve deeper and wider into different optimization the transformer has gone through onwards 🚀