Lineage of Transformer

The below is a set of recommendations by ChatGPT which will walk me through the evolution of architecture and help recognise the key axises of improvements. Seem’s like a fair list, lets start reading! I will be dropping notes for each on my Papers

Attention Is All You Need (refresh) - Notes ✅
GPT-2 paper ✅
- Paper and Code from OpenAI
- My notes here → GPT-2
“On Layer Normalization in the Transformer Architecture” ✅
- Paper
- Notes - Layer Norm in Transformers
GLU Variants Improve Transformer ✅
- PDF
- Notes - GLU Improves Transformers
RoFormer (RoPE) ✅
- RoPE
ALiBi paper ✅
- Notes
LLaMA paper ✅ ( would recommend reading this after Chinchilla)
- LLaMA
Chinchilla paper ✅
- Chinchilla

That sequence gives you the evolutionary logic.

The Mental Model You Should Walk Away With

Transformer architecture evolved along four axes:

Stability (post-norm → pre-norm → RMSNorm)
Activation efficiency (ReLU → GELU → SwiGLU)
Positional encoding robustness (absolute → RoPE → ALiBi)
Compute efficiency (scaling laws, systems optimizations)

Post-Thoughts

Reading through the above list gives a fairly sufficient understanding of the evolution of the transformer architecture over the years, the main theme has been algorithmic efficiency and experimental / empirical findings for optimal hyper-param for both model and training. A lot of work also goes into the making learning stable which in turn allows longer and bigger training runs.

It is now certainly obvious that most of the challenges we see today “engineering” these systems, requiring us to solve system’s type of problems like sharded-training because a model obviously no longer fits into a single GPU and how huge amounts of data can be moved around to the right nodes etc. The secret sauce seems to be to train large models on insanely large data for a large amount of time on an insanely large number of GPUs.

Also simplicity seems to be the key as most of the complexity somehow seems to get handled by scaling up. For example, ALiBi is far more simpler than RoPE , sin encoding or learned positions. But I doubt any other such simplistic improvements are left. Lets see.

Modern LLMs are not radically different from 2017. They are stabilized and scaled transformers.

Lineage of Transformer

The Mental Model You Should Walk Away With

Post-Thoughts

Graph View

Table of Contents

Backlinks

Lineage of Transformer

Recommend You Read (In Order)

The Mental Model You Should Walk Away With

Post-Thoughts

Graph View

Table of Contents

Backlinks