ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING
Overview
The proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation
RoFormer is already integrated into Huggingface here
Background
Note: Has quite a bit equations, copying over verbatim and annotating/simplifying words as needed.
Let be a sequence of N input tokens with wi being the i th element. The corresponding word embedding of is denoted as , where ∈ is the d-dimensional word embedding vector of token wi without position information. The self-attention first incorporates position information to the word embeddings and transforms them into queries, keys, and value representations.
Eq (1) - fancy way of saying we include positional data in the input embedding.
Eq (2) - really just performing the “attention mapping” i.e:
Eq (3) - typically Eq (1) is chosen to be :
Absolute Position Embedding
In Vaswani et al. they use sinusoidal based , i.e: below is how Eq(1) expands for that paper.
Eq(4)
Relative Position Embedding
Instead of expanding Eq(1) as Eq(3) , a different setting can be chosen as below:
i.e: we omit pos-enc from the query vec and use trainable pos encodings
p
Eq(5)
where , ∈ are trainable relative position embeddings. Note that r = clip(m − n, r_min, r_max) represents the relative distance between position m and n. They clipped the relative distance with the hypothesis that precise relative position information is not useful beyond a certain distance.
Now keeping the form of Eq (3) if we expand the attn
Q.Kpart of Eq (2) we get the below:
Eq (6) : Lets consider the above representation of Q.K as Eq (6)
Not clear at the moment why the below is done, but following along
replace the absolute position embedding with its sinusoid-encoded relative counterpart , while the absolute position in the third and fourth term with two trainable vectors and independent of the query positions. Further, is distinguished for the content-based and location-based key vectors and , denoted as and , resulting in:
After a bunch of removing things and adding things back due to various other works , we arrive at the below Eq 10
Eq (10)
Note the absolute position embeddings and were simply replaced with the relative position embeddings
Note all the previous approaches attempt to modify Eq 6 under the decomposition of Eq 3 while following the rules of self-attn Eq 2 , but this paper RoPE aims to modify Eq 1 under a different set of constraints.
So lets see what they have for us.
Proposed Approach
2D Form
Eq 12
Notice the in the
Eq 13
Specifically, incorporating the relative position embedding is straightforward: simply rotate the affine-transformed word embedding vector by amount of angle multiples of its position index and thus interprets the intuition behind Rotary Position Embedding.
Intuition:
Take two tokens at positions m and n.
-
Before RoPE: their projected vectors are just arrows; dot product measures “semantic alignment.”
-
With RoPE: you spin each arrow by an amount tied to its position.
-
When you dot them, you’re asking: “after spinning, how aligned are they?”
-
The extra alignment/phase difference introduced is governed by (m-n), so attention can learn patterns like “strongly attend to the previous token” or “attend to the token 5 steps back” in a way that’s naturally shift-invariant.
Generalised Form
In order to generalize our results in 2D to any xi ∈Rd where dis even, we divide the d-dimension space into d/2 sub-spaces and combine them in the merit of the linearity of the inner product, turning into :
where → =
Play around with the RoPE
Overall Flow
-
A sequence is made up of tokens:
S = (s1, s2, ..., sN). -
Each token
smis represented (after embedding + linear projection) as a vector inR^d. Example: , . -
To inject positional information with RoPE, we rotate these projected vectors with a position-dependent rotation operator:
-
RoPE forms dimension pairs inside each
d-dim vector:(0,1), (2,3), ..., (d-2, d-1)So there ared/2pairs per token.
-
Each pair is treated as its own 2D plane and gets its own frequency
θ_i. For token positionm, that plane is rotated by the angle: -
The full rotation operator
R_{Θ,m}is a d × d block-diagonal matrix made of d/2 independent 2 × 2 rotation blocks (one per dimension-pair).
Example (take d = 4 and sequence (s1, s2, s3, s4)):
- Each token vector
v_m ∈ R^4splits into 2 pairs:- Pair 1:
(v_m[0], v_m[1])usesθ_1and rotates bym * θ_1 - Pair 2:
(v_m[2], v_m[3])usesθ_2and rotates bym * θ_2
- Pair 1:
So concretely:
-
For position
m = 1(tokens1):(v_1[0], v_1[1])rotates by1 * θ_1 = θ_1(v_1[2], v_1[3])rotates by1 * θ_2 = θ_2
-
For position
m = 4(tokens4):(v_4[0], v_4[1])rotates by4 * θ_1(v_4[2], v_4[3])rotates by4 * θ_2
Relative-position payoff:
- When attention computes , the rotations combine so the interaction depends on
the relative offset
(m - n):- so the score can be written as:
Important Takeaway
In RoPE, the position is encoded not at the beginning like in sinusoidal pos embedding, but at each layer as part of the attention mechanism!
Notice here , the huggingface GPT-J implementation, the pos-encoding is added inside the attn forward pass.
Also note that even when computing attention, the positional information is only imparted to the Q & K and not V.
Got this insight when reading ALiBi :
The output of a self-attention sublayer is a linearly transformed, weighted sum of the input value vectors; therefore, by not inserting position information into the values, the outputs of each transformer-layer contain no explicit position information.
In the interest of scope i wont dive deep into the derivation parts of RoPE.
Also there is reference to a “Performer” paper which uses “Linear Attention” which might be worth deep diving into at a later point, but good to know.