TRAIN SHORT, TEST LONG: ATTENTION WITH LINEAR BIASES ENABLES INPUT LENGTH EXTRAPOLATION
Important definition:
We define extrapolation as a model’s ability to continue performing well as the number of input tokens during validation increases beyond the number of tokens on which the the model was trained.
Goal of ALiBi is to facilitate efficient extrapolation.
Core Idea:
ALiBi negatively biases attention scores with a linearly decreasing penalty proportional to the distance between the relevant key and query.
Appealing Advantage:
Using ALiBi, a transformer LM can be trained on short-L sequences and therefore at much lower cost, and it can still be reliably applied to long sequences at runtime.
”Non-overlapping Inference”
If , their default perplexity uses non overlapping inference, i.e., the long text is split into independent length-L chunks and evaluated independently, so context is not carried across chunk boundaries. This protocol mostly tests whether the model can operate stably at longer positions, not whether it can exploit cross-chunk long history. ALiBi primarily targets positional extrapolation (1) it does not by itself demonstrate improved long-history usage (2).
Understanding the 2 separate two effects:
- Ability to process longer input (extrapolate position behavior)
- Ability to meaningfully use the extra history (not just survive it)
Only 1. is what this paper talks about and not 2.
Context Length isn’t A Modelling Constraint
Formally, the functions that define a transformer layer are agnostic to input length; they map from some arbitrary, unfixed number of input vectors to the same number of output vectors.
Remember from transformer implementation that the sequence length gets its own axis on the input dims, [Batch, Seq_Len, Model_Dims] hence sequence length is not necessarily a model restriction but an engineering/use-case dependent constraint.
Comparative Summary vs other Positional Encodings
Sinusoidal (Vaswani): absolute-ish. You add a fixed position vector to embeddings once at input. Precomputable, no learned params.
RoPE : relative-ish in the sense that dot products depend on relative offset. It’s also fixed/precomputable per position and per head/dim-pair frequencies, but it’s applied as a rotation of Q/K (not an additive thing), typically inside each attention block (conceptually; implementations fold it in efficiently).
T5 relative position bias: relative. You add an additive bias to attention logits based on relative distance bucket. Those biases are learned parameters (per head, per bucket).
Now ALiBi:
-
Yes, it “goes back to constants” in the sense that it doesn’t learn position embeddings or a bias table.
-
It uses an additive bias in the attention logits (like T5’s bias mechanism), but instead of a learned lookup by distance, it uses a simple linear function of distance.
The canonical ALiBi idea is:
where i is the query position, j is the key position, and m_h is a head-specific negative slope (so farther-back tokens get more negative bias; “recency prior”). In causal attention , so .
So: distance becomes the bias, scaled by a fixed per-head slope.

Important:
ALiBi has an inductive bias towards recency; it penalizes attention scores between distant query-key pairs, with the penalty increasing as the distance between a key and a query grows. The different heads increase their penalties at different rates, depending on the slope magnitude.
Neat implementation detail!
ALiBi is easy to implement, with all changes accomplished in a few lines of code. We implement it by modifying the mask matrix by adding the linear biases to it (in practice, when training a transformer LM, query qi attends only to keys 1 to i; this is implemented by adding a mask matrix to the query-key dot product before the softmax operation is applied). This means that there is no runtime penalty when using our method since we add no operations to the network.