The below is a set of recommendations by ChatGPT which will walk me through the evolution of architecture and help recognise the key axises of improvements. Seem’s like a fair list, lets start reading! I will be dropping notes for each on my Papers
Recommend You Read (In Order)
- Attention Is All You Need (refresh) - Notes ✅
- GPT-2 paper ✅
- “On Layer Normalization in the Transformer Architecture” ✅
- Paper
- Notes - Layer Norm in Transformers
- GLU Variants Improve Transformer ✅
- Notes - GLU Improves Transformers
- RoFormer (RoPE) ✅
- ALiBi paper ✅
- LLaMA paper ✅ ( would recommend reading this after Chinchilla)
- Chinchilla paper ✅
That sequence gives you the evolutionary logic.
The Mental Model You Should Walk Away With
Transformer architecture evolved along four axes:
- Stability (post-norm → pre-norm → RMSNorm)
- Activation efficiency (ReLU → GELU → SwiGLU)
- Positional encoding robustness (absolute → RoPE → ALiBi)
- Compute efficiency (scaling laws, systems optimizations)
Post-Thoughts
Reading through the above list gives a fairly sufficient understanding of the evolution of the transformer architecture over the years, the main theme has been algorithmic efficiency and experimental / empirical findings for optimal hyper-param for both model and training. A lot of work also goes into the making learning stable which in turn allows longer and bigger training runs.
It is now certainly obvious that most of the challenges we see today “engineering” these systems, requiring us to solve system’s type of problems like sharded-training because a model obviously no longer fits into a single GPU and how huge amounts of data can be moved around to the right nodes etc. The secret sauce seems to be to train large models on insanely large data for a large amount of time on an insanely large number of GPUs.
Also simplicity seems to be the key as most of the complexity somehow seems to get handled by scaling up. For example, ALiBi is far more simpler than RoPE , sin encoding or learned positions. But I doubt any other such simplistic improvements are left. Lets see.
Modern LLMs are not radically different from 2017. They are stabilized and scaled transformers.