ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING

Overview

The proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation

RoFormer is already integrated into Huggingface here

Background

Note: Has quite a bit equations, copying over verbatim and annotating/simplifying words as needed.

Let be a sequence of N input tokens with wi being the i th element. The corresponding word embedding of is denoted as , where is the d-dimensional word embedding vector of token wi without position information. The self-attention first incorporates position information to the word embeddings and transforms them into queries, keys, and value representations.

Eq (1) - fancy way of saying we include positional data in the input embedding.

Eq (2) - really just performing the “attention mapping” i.e:

Eq (3) - typically Eq (1) is chosen to be :

Absolute Position Embedding

In Vaswani et al. they use sinusoidal based , i.e: below is how Eq(1) expands for that paper.

Eq(4)

Relative Position Embedding

Instead of expanding Eq(1) as Eq(3) , a different setting can be chosen as below:

i.e: we omit pos-enc from the query vec and use trainable pos encodings p

Eq(5)

where , are trainable relative position embeddings. Note that r = clip(m − n, r_min, r_max) represents the relative distance between position m and n. They clipped the relative distance with the hypothesis that precise relative position information is not useful beyond a certain distance.

Now keeping the form of Eq (3) if we expand the attn Q.K part of Eq (2) we get the below:

Eq (6) : Lets consider the above representation of Q.K as Eq (6)

Not clear at the moment why the below is done, but following along

replace the absolute position embedding with its sinusoid-encoded relative counterpart , while the absolute position in the third and fourth term with two trainable vectors and independent of the query positions. Further, is distinguished for the content-based and location-based key vectors and , denoted as and , resulting in:

After a bunch of removing things and adding things back due to various other works , we arrive at the below Eq 10

Eq (10)

Note the absolute position embeddings and were simply replaced with the relative position embeddings

Note all the previous approaches attempt to modify Eq 6 under the decomposition of Eq 3 while following the rules of self-attn Eq 2 , but this paper RoPE aims to modify Eq 1 under a different set of constraints.

So lets see what they have for us.

Proposed Approach

2D Form

Eq 12

Notice the in the

Eq 13

Specifically, incorporating the relative position embedding is straightforward: simply rotate the affine-transformed word embedding vector by amount of angle multiples of its position index and thus interprets the intuition behind Rotary Position Embedding.

Intuition:

Take two tokens at positions m and n.

  • Before RoPE: their projected vectors are just arrows; dot product measures “semantic alignment.”

  • With RoPE: you spin each arrow by an amount tied to its position.

  • When you dot them, you’re asking: “after spinning, how aligned are they?”

  • The extra alignment/phase difference introduced is governed by (m-n), so attention can learn patterns like “strongly attend to the previous token” or “attend to the token 5 steps back” in a way that’s naturally shift-invariant.

Generalised Form

In order to generalize our results in 2D to any xi ∈Rd where dis even, we divide the d-dimension space into d/2 sub-spaces and combine them in the merit of the linearity of the inner product, turning into :

where =

Play around with the RoPE

Overall Flow

  • A sequence is made up of tokens: S = (s1, s2, ..., sN).

  • Each token sm is represented (after embedding + linear projection) as a vector in R^d. Example: , .

  • To inject positional information with RoPE, we rotate these projected vectors with a position-dependent rotation operator:

  • RoPE forms dimension pairs inside each d-dim vector:

    • (0,1), (2,3), ..., (d-2, d-1) So there are d/2 pairs per token.
  • Each pair is treated as its own 2D plane and gets its own frequency θ_i. For token position m, that plane is rotated by the angle:

  • The full rotation operator R_{Θ,m} is a d × d block-diagonal matrix made of d/2 independent 2 × 2 rotation blocks (one per dimension-pair).

Example (take d = 4 and sequence (s1, s2, s3, s4)):

  • Each token vector v_m ∈ R^4 splits into 2 pairs:
    • Pair 1: (v_m[0], v_m[1]) uses θ_1 and rotates by m * θ_1
    • Pair 2: (v_m[2], v_m[3]) uses θ_2 and rotates by m * θ_2

So concretely:

  • For position m = 1 (token s1):

    • (v_1[0], v_1[1]) rotates by 1 * θ_1 = θ_1
    • (v_1[2], v_1[3]) rotates by 1 * θ_2 = θ_2
  • For position m = 4 (token s4):

    • (v_4[0], v_4[1]) rotates by 4 * θ_1
    • (v_4[2], v_4[3]) rotates by 4 * θ_2

Relative-position payoff:

  • When attention computes , the rotations combine so the interaction depends on the relative offset (m - n):
    • so the score can be written as:

Important Takeaway

In RoPE, the position is encoded not at the beginning like in sinusoidal pos embedding, but at each layer as part of the attention mechanism!

Notice here , the huggingface GPT-J implementation, the pos-encoding is added inside the attn forward pass.

Also note that even when computing attention, the positional information is only imparted to the Q & K and not V.

Got this insight when reading ALiBi :

The output of a self-attention sublayer is a linearly transformed, weighted sum of the input value vectors; therefore, by not inserting position information into the values, the outputs of each transformer-layer contain no explicit position information.


In the interest of scope i wont dive deep into the derivation parts of RoPE.


Also there is reference to a “Performer” paper which uses “Linear Attention” which might be worth deep diving into at a later point, but good to know.