What is RoPE?

Rotary Position Embedding (RoPE) rotates vectors in 2D subspaces based on their position. The key: Relative position between two tokens corresponds to the rotation difference of their embeddings. This enables zero-shot length extrapolation and is used in Llama, PaLM and GPT-NeoX.

Controls

5
10
0.10

Rotation Angles

Position 1 (Query)
0.50 rad
Position 2 (Key)
1.00 rad
Relative Rotation
0.50 rad
Dot Product
0.877
Fig. 1 | RoPE Rotation in 2D subspace. Blue vector (Position 1/Query) and orange vector (Position 2/Key) rotate by different angles. The angle difference encodes their relative position. The dot product is invariant to rotation, but depends on the relative angle.

RoPE Formula

RoPE(x, m) = [
cos(m·θ) · x₀ - sin(m·θ) · x₁,
sin(m·θ) · x₀ + cos(m·θ) · x₁
]

m: Position of the token
θ: Rotation frequency (e.g., 1/100002i/d)
x₀, x₁: 2D subspace of the vector

Why Does RoPE Work?

The dot product between Query at position m and Key at position n:

q_m · k_n = q · k · cos((m-n)·θ)

Depends only on the relative position (m-n), not on absolute positions! This enables length extrapolation: If the model was trained on 2K tokens, it often still works at 8K+ tokens.