Phase_0

This is the “Readme” for Baseline/phase-0 , will be documenting progress and such here.

Goals

Trace and understand lineage of architectural evolution and design choices.
Implement a modern-minimal baseline language model
Train it auto-regressively.
Track/care about evals.
Make it generate text!!!

Progress

Note: Find links and other useful resource for this phase in Resources

Completed ✔️

Conclusions & Locked Configs

These are the empirically validated settings from Phase 0 experiments. Carry these forward as Phase 1 defaults unless explicitly ablating them.

Architecture

Parameter	Value	Source
`norm_placement`	`pre`	`pre-vs-post-layer-norm-2` group
`attention_impl`	`sdpa`	Memory & GPU experiments
`dtype`	`bfloat16`	All runs
`torch_compile`	`True`, mode=`default`	Activation checkpointing experiments

Norm placement: Pre-LN with a final decoder LN converges stably without warmup. Post-LN requires warmup and produces ~10 orders of magnitude variation in per-layer gradient norms on deep models without it. Pre-LN gradient norms are O(1) across layers with an expected spike at layer 0 (residual highway from unnormalized embeddings). See Pre vs Post Norms, With & Without Warmups.

Training Recipe

Parameter	Value	Source
`optimizer`	AdamW	Optimizer sweep
`lr_warmup_steps`	0 (with Pre-LN)	Pre vs Post LN experiment
`weight_decay`	0 (revisit for Phase 1 scale)	Weight decay experiments
`accumulation_steps`	1	Gradient quality experiment

Gradient accumulation: On wikitext-2, gradient coherence is >0.999 at all effective batch sizes tested. The critical batch size is below mb=28, meaning accumulation provides no quality benefit on this dataset. This finding is dataset-specific — revisit on larger, more heterogeneous datasets in Phase 1. Run group: 4-effective-batch-fixed-lr-20260316-151845-575080.

Memory & Compute (A100 40GB, d_model=768, seq_len=1024, n_layers=12)

⚠️ Note on measurement validity: Three frontier sweeps were run. The first (memory-frontier-step1-compile-budget-20260313-091942-166699) had compile_warmup_steps=0 and data_fraction=0.1, making timing metrics unreliable due to JIT compile bleed. Groups 3 and 4 below are the valid results.

No-AC frontier (canonical, summary_stage=final): W&B group: 4-memory-frontier-step1-compile-20260314-143138-930527

mb	peak_reserved	Notes
16	22.5 GiB	Safe, lower throughput
28	38.1 GiB	Fast but ~2 GiB OOM margin — risky
30	38.95 GiB	Effectively at the limit
32+	OOM	—

AC budget frontier: W&B group: 3-memory-frontier-step1-compile-budget-20260314-091315-256427

AC budget	mb	peak_reserved	Relative throughput
0.12	32	25.7 GiB	Fast
0.38	32	30.2 GiB	Fast
0.75	32	34.2 GiB	Fastest
0.75	40+	OOM	—

Recommended operating point for Phase 1: activation_memory_budget=0.75, micro_batch_size=32. Gives the highest throughput with 6 GiB headroom. Validate with a fresh 200-step timing run before committing a long training job, as the AC timing data came from a different host than the final no-AC validation.

Note on avg_step_time_ms_epoch vs wall-clock: This metric excludes compile warmup steps (the first 3 steps are dramatically slower during JIT compilation). Epoch-level wall-clock time is the more reliable throughput estimate. The ~2–4× discrepancy between step-average and wall-clock is compile overhead, not measurement error.

What Doesn’t Work / Dataset Limitations

wikitext-2 is too small for architectural ablations. At ~2M tokens, ~30 optimizer steps per epoch at Phase 1 config. Architectural differences (RoPE vs ALiBi, MHA vs GQA) will not surface at quality level. Move to OpenWebText or FineWeb-Edu for Phase 1 Stage 2+.
peak_allocated is unreliable with torch.compile. Use peak_reserved for OOM proximity assessment. peak_allocated is non-monotonic due to compile-time scratch buffer inflation before the measurement window. Fix: call torch.cuda.reset_peak_memory_stats() after warmup steps.
Loss-vs-tokens comparisons require LR scaling. Without Goyal et al. linear scaling (or McCandlish et al. sqrt rule), loss-vs-tokens conflates batch size and update frequency effects. Comparisons across effective batch sizes are only valid when LR is co-scaled.

Closing Thoughts

Going through this has helped fundamentally understand, learn and experience what it feels like to pipeline-data, transfer research reads into code, train models on cloud-gpu and really learn to read the “learning curves” on wandb.

I also had setup the repo structure and other meta experiments for how things should be along the way during this phase, which took some extra time but was totally worth it.

I move forward knowing that i can confidently understand and train simple models and implement research ideas from paper and see them through to the metrics. Have covered the baseline, I’m excited to delve deeper and wider into different optimization the transformer has gone through onwards 🚀

Phase_0

Goals

Progress

Conclusions & Locked Configs

Architecture

Training Recipe

Memory & Compute (A100 40GB, d_model=768, seq_len=1024, n_layers=12)

What Doesn’t Work / Dataset Limitations

Closing Thoughts

Baseline Experiments

Implementing a Baseline Model

Training GPT-2 with bigger data

Lineage of Transformer

Resources

Graph View