ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING

Overview

The proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation

RoFormer is already integrated into Huggingface here

Background

Note: Has quite a bit equations, copying over verbatim and annotating/simplifying words as needed.

Let $S_{N} = {w_{i}}_{i = 1}^{N}$ be a sequence of N input tokens with wi being the i th element. The corresponding word embedding of $S_{N}$ is denoted as $E_{N} = {x_{i}}_{i = 1}^{N}$ , where $x_{i}$ ∈ $R^{d}$ is the d-dimensional word embedding vector of token wi without position information. The self-attention first incorporates position information to the word embeddings and transforms them into queries, keys, and value representations.

Eq (1) - fancy way of saying we include positional data in the input embedding.

q_{m} = f_{q} (x_{m}, m)

k_{n} = f_{k} (x_{n}, n)

v_{n} = f_{v} (x_{n}, n)

Eq (2) - really just performing the “attention mapping” i.e: $Q^{T} . K$

a_{m, n} o_{m} = \frac{exp ( \frac{q _{m}^{⊤} k _{n}}{d} )}{\sum _{j = 1}^{N} exp ( \frac{q _{m}^{⊤} k _{j}}{d} )} = n = 1 \sum N a_{m, n} v_{n}

Eq (3) - typically Eq (1) is chosen to be :

f_{t : t \in {q, k, v}} (x_{i}, i) := W_{t : t \in {q, k, v}} (x_{i} + p_{i}),

Absolute Position Embedding

In Vaswani et al. they use sinusoidal based $p_{i}$ , i.e: below is how Eq(1) expands for that paper.

Eq(4)

{p_{i, 2 t} = sin (\frac{k}{1000 0 ^{2 t / d}}) p_{i, 2 t + 1} = cos (\frac{k}{1000 0 ^{2 t / d}})

Relative Position Embedding

Instead of expanding Eq(1) as Eq(3) , a different setting can be chosen as below:

i.e: we omit pos-enc from the query vec and use trainable pos encodings p

Eq(5)

f_{q} (x_{m}) f_{k} (x_{n}, n) f_{v} (x_{n}, n) := W_{q} x_{m} := W_{k} (x_{n} + \tilde{p}_{r}^{k}) := W_{v} (x_{n} + \tilde{p}_{r}^{v})

where $\tilde{p}_{r}^{k}$ , $\tilde{p}_{r}^{v}$ ∈ $R_{d}$ are trainable relative position embeddings. Note that r = clip(m − n, r_min, r_max) represents the relative distance between position m and n. They clipped the relative distance with the hypothesis that precise relative position information is not useful beyond a certain distance.

Now keeping the form of Eq (3) if we expand the attn Q.K part of Eq (2) we get the below:

q_{m} = W_{q} (x_{m} + p_{m}), k_{n} = W_{k} (x_{n} + p_{n})

q_{m}^{⊤} k_{n} = (x_{m} + p_{m})^{⊤} W_{q}^{⊤} W_{k} (x_{n} + p_{n})

q_{m}^{⊤} k_{n} = x_{m}^{⊤} W_{q}^{⊤} W_{k} x_{n} + x_{m}^{⊤} W_{q}^{⊤} W_{k} p_{n} + p_{m}^{⊤} W_{q}^{⊤} W_{k} x_{n} + p_{m}^{⊤} W_{q}^{⊤} W_{k} p_{n}

Eq (6) : Lets consider the above representation of Q.K as Eq (6)

Not clear at the moment why the below is done, but following along

replace the absolute position embedding $p_{n}$ with its sinusoid-encoded relative counterpart $p ˜_{m - n}$ , while the absolute position $p_{m}$ in the third and fourth term with two trainable vectors $u$ and $v$ independent of the query positions. Further, $W_{k}$ is distinguished for the content-based and location-based key vectors $x_{n}$ and $p_{n}$ , denoted as $W_{k}$ and $\tilde{W}_{k}$ , resulting in:

q_{m}^{⊤} k_{n} = x_{m}^{⊤} W_{q}^{⊤} W_{k} x_{n} + x_{m}^{⊤} W_{q}^{⊤} \tilde{W}_{k} \tilde{p}_{m - n} + u^{⊤} W_{q}^{⊤} W_{k} x_{n} + v^{⊤} W_{q}^{⊤} \tilde{W}_{k} \tilde{p}_{m - n}

After a bunch of removing things and adding things back due to various other works , we arrive at the below Eq 10

Eq (10)

q_{m}^{⊤} k_{n} = x_{m}^{⊤} W_{q}^{⊤} W_{k} x_{n} + x_{m}^{⊤} W_{q}^{⊤} W_{k} \tilde{p}_{m - n} + \tilde{p}_{m - n}^{⊤} W_{q}^{⊤} W_{k} x_{n}

Note the absolute position embeddings $p_{m}$ and $p_{n}$ were simply replaced with the relative position embeddings $\tilde{p}_{m - n}$

Note all the previous approaches attempt to modify Eq 6 under the decomposition of Eq 3 while following the rules of self-attn Eq 2 , but this paper RoPE aims to modify Eq 1 under a different set of constraints.

So lets see what they have for us.

Proposed Approach

2D Form

Eq 12

f_{q} (x_{m}, m) f_{k} (x_{n}, n) g (x_{m}, x_{n}, m - n) = (W_{q} x_{m}) e^{im θ} = (W_{k} x_{n}) e^{in θ} = Re [(W_{q} x_{m}) (W_{k} x_{n})^{*} e^{i (m - n) θ}]

Notice the $(m - n) θ$ in the $e^{i (m - n) θ}$

Eq 13

f_{{q, k}} (x_{m}, m) = (cos m θ sin m θ - sin m θ cos m θ) (W_{{q, k}}^{(11)} W_{{q, k}}^{(21)} W_{{q, k}}^{(12)} W_{{q, k}}^{(22)}) (x_{m}^{(1)} x_{m}^{(2)})

Specifically, incorporating the relative position embedding is straightforward: simply rotate the affine-transformed word embedding vector by amount of angle multiples of its position index and thus interprets the intuition behind Rotary Position Embedding.

Intuition:

Take two tokens at positions m and n.

Before RoPE: their projected vectors are just arrows; dot product measures “semantic alignment.”
With RoPE: you spin each arrow by an amount tied to its position.
When you dot them, you’re asking: “after spinning, how aligned are they?”
The extra alignment/phase difference introduced is governed by (m-n), so attention can learn patterns like “strongly attend to the previous token” or “attend to the token 5 steps back” in a way that’s naturally shift-invariant.

Generalised Form

In order to generalize our results in 2D to any xi ∈Rd where dis even, we divide the d-dimension space into d/2 sub-spaces and combine them in the merit of the linearity of the inner product, turning $f_{q, k}$ into :

f_{{q, k}} (x_{m}, m) = R_{Θ, m}^{d} W_{{q, k}} x_{m} (14)

where R_{Θ, m}^{d} = cos m θ_{1} sin m θ_{1} 00 ⋮ 00 - sin m θ_{1} cos m θ_{1} 00 ⋮ 00 00 cos m θ_{2} sin m θ_{2} ⋮ 00 00 - sin m θ_{2} cos m θ_{2} ⋮ 00 \dots \dots \dots \dots ⋱ \dots \dots 0000 ⋮ cos m θ_{d /2} sin m θ_{d /2} 0000 ⋮ - sin m θ_{d /2} cos m θ_{d /2} (15)

where → $R_{Θ, n - m}^{d}$ = $(R_{Θ, m}^{d})^{T}$ $R_{Θ, n}^{d}$

Play around with the RoPE

Overall Flow

A sequence is made up of tokens: S = (s1, s2, ..., sN).
Each token sm is represented (after embedding + linear projection) as a vector in R^d. Example: $q_{m} = W_{q} x_{m}$ , $k_{m} = W_{k} x_{m}$ .
To inject positional information with RoPE, we rotate these projected vectors with a position-dependent rotation operator:
- $\tilde{q}_{m} = R_{Θ, m} q_{m}$
- $\tilde{k}_{m} = R_{Θ, m} k_{m}$
RoPE forms dimension pairs inside each d-dim vector:
- (0,1), (2,3), ..., (d-2, d-1) So there are d/2 pairs per token.
Each pair is treated as its own 2D plane and gets its own frequency θ_i. For token position m, that plane is rotated by the angle:
- $φ_{m, i} = m * θ_{i}$
The full rotation operator R_{Θ,m} is a d × d block-diagonal matrix made of d/2 independent 2 × 2 rotation blocks (one per dimension-pair).

Example (take d = 4 and sequence (s1, s2, s3, s4)):

Each token vector v_m ∈ R^4 splits into 2 pairs:
- Pair 1: (v_m[0], v_m[1]) uses θ_1 and rotates by m * θ_1
- Pair 2: (v_m[2], v_m[3]) uses θ_2 and rotates by m * θ_2

So concretely:

For position m = 1 (token s1):
- (v_1[0], v_1[1]) rotates by 1 * θ_1 = θ_1
- (v_1[2], v_1[3]) rotates by 1 * θ_2 = θ_2
For position m = 4 (token s4):
- (v_4[0], v_4[1]) rotates by 4 * θ_1
- (v_4[2], v_4[3]) rotates by 4 * θ_2

Relative-position payoff:

When attention computes $\tilde{q}_{m}^{T} \tilde{k}_{n}$ , the rotations combine so the interaction depends on the relative offset (m - n):
- $R_{Θ, m}^{T} R_{Θ, n} = R_{Θ, n - m}$ so the score can be written as:
- $q_{m}^{T} R_{Θ, n - m} k_{n}$

Important Takeaway

In RoPE, the position is encoded not at the beginning like in sinusoidal pos embedding, but at each layer as part of the attention mechanism!

Notice here , the huggingface GPT-J implementation, the pos-encoding is added inside the attn forward pass.

Also note that even when computing attention, the positional information is only imparted to the Q & K and not V.

Got this insight when reading ALiBi :

The output of a self-attention sublayer is a linearly transformed, weighted sum of the input value vectors; therefore, by not inserting position information into the values, the outputs of each transformer-layer contain no explicit position information.

In the interest of scope i wont dive deep into the derivation parts of RoPE.

Also there is reference to a “Performer” paper which uses “Linear Attention” which might be worth deep diving into at a later point, but good to know.

RoFormer