GLU Variants Improve Transformer

PDF Link

Summary

Pretty quick read, the basic idea is that GLU based activations when used in the the FFN layers of the transformer tend to reduce perplexity both on short and long horizon training runs.

Noam compares different variants of GLU and SwiGLU comes out on top both on stability on longer runs and across different tasks/datasets.

But what i loved about this is how it ends, and i quote:

We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.


Based on some poking around on ChatGPT:

How it’s implemented in practice (why it can look like “only 2”)

Most implementations fuse W and V into a single linear layer that outputs 2*d_ff, then split:

  • One big matrix W_in ∈
  • Compute h =
  • Split h into a, b each of shape
  • Output
  • Then apply

So you’ll see “2 Linear layers” in code ( and ), but parameter-wise it’s still 3 logical matrices: W and V are just concatenated inside .


On GLU

A standard Transformer FFN does:

It’s expressive, but the “decision” (what to pass through) is entangled with the “content” (what the features are), because it’s all inside h .

GLU explicitly splits those roles:

Two projections:

  • Value branch (): “what information do I have available?”
  • Gate branch (): “how much of each feature should I let through?”

Then you multiply them elementwise. That multiplication is the key: it creates feature-wise, input-dependent routing.


A Useful Intuition

  • Activation-only FFN: “I’ll compute features and squish them.”

  • GLU FFN: “I’ll compute features, and separately compute a controller that decides how much of each feature gets through.”

It’s the difference between “nonlinearity as a squashing function” vs “nonlinearity as conditional routing.”