CHAPTER 4.2b · CONTEXT MECHANISMS

RoPE & ALiBi Position Encoding

How modern models extend their context windows: Rotary and Linear Bias Position Encoding

RoPE vs. ALiBi – two paths to the same goal: context extension beyond training length. RoPE rotates embeddings based on position, ALiBi adds linear biases. Llama, GPT-NeoX and most modern models use RoPE, BLOOM and some older models use ALiBi.

📖 Learning Context ▼

Understand the fundamental differences between RoPE and ALiBi
Recognize why RoPE dominates today (learnable vs. fixed positions)
Follow the context extension problem: Why 4K → 100K?

Step 2/6 Optimizations & Memory

This overview connects the detail visualizations (RoPE Rotation, ALiBi Heatmap) and explains when each method is appropriate.

The choice between RoPE and ALiBi affects whether a model can be finetuned to longer contexts. RoPE allows Position Interpolation (4K → 32K without retraining), ALiBi extrapolates natively but loses precision.

RoPE: Rotates Q/K vectors, relative position encoded through rotation
ALiBi: Adds fixed negative bias proportional to distance
Practice: RoPE with YaRN/NTK-Extension is today's standard for 128K+ contexts

The Context Extension Problem

LLMs are typically trained on 4K tokens. The question is: How can they extrapolate to 100K+ tokens?

The problem: Position information is tightly coupled to training sequence length. Sinusoidal Positional Encodings (original Transformer paper) scale poorly to longer sequences.

There are two modern solutions:

RoPE (Rotary Position Embedding): Position information through rotation. Requires Position Interpolation fine-tuning
ALiBi (Attention with Linear Biases): Linear biases on attention scores. Zero-shot extrapolation possible

RoPE – Rotary Position Embedding

RoPE encodes position through rotation of 2D vectors. The critical point: The attention score between position m and n depends only on the relative distance (m-n), not on absolute positions.

RoPE Rotation Matrix:

[q_m, k_n] = q_m · R(θ(m)) · k_n · R(θ(n))^T

where θ(pos) = Σ_i θ_i · 10000^-2i/d

Attention depends only on (m-n), not on m or n alone

Position Interpolation for Extrapolation:
By scaling positions, a model trained on 4K can be extrapolated to 100K+:

Original Position 1-4000 → Scaled to 1-100000
Only ~1000 fine-tuning steps needed
No quality loss on longer sequences

Fig. 1 | Position Interpolation: Original positions 1-4000 are stretched to 1-100000. The scaling enables extrapolation without massive fine-tuning.

ALiBi – Attention with Linear Biases

ALiBi is conceptually simpler: Instead of trainable position embeddings, it adds static linear biases directly to attention scores based on token distance.

ALiBi Attention:

Attention(Q, K, V) = softmax((QK^T) / √d_k + bias_matrix) · V

bias[i,j] = -m · |i - j|

where m is a head-specific slope (not trained)

Major Advantages:

Zero-Shot Extrapolation: Model trained on 1K, works on 2K+, 100K+, 1M+ without adjustment
No trainable parameters: m is fixed, not trained
Head Specialization: Each head has its own slope m, learns different distance sensitivities
Easy to implement: A constant subtraction

Fig. 2 | ALiBi Bias Matrix: Diagonal is 0 (self), then linearly more negative with distance. Different heads show different slopes m (different colors = different m values).

Sequence Length (RoPE vs ALiBi): 4K Tokens

Comparison: RoPE vs ALiBi vs Sinusoidal

Feature	Sinusoidal (Original)	RoPE (Llama, Mistral)	ALiBi (MPT, PaLM)
Mechanism	sin/cos functions	Vector rotation	Linear biases
Relative Position	Implicit	Explicit ✓	Explicit ✓
Trainable Parameters	None	None	None (only fixed m)
Extrapolation without FT	❌ Very poor	❌ Needs FT	✓ Works directly
Extrapolation with FT	⚠ Poor	✓ Excellent	✓ Excellent
Computational Cost	Low	Higher (rotations)	Very low

Fig. 3 | Extrapolation Test: Perplexity over sequence length. Training on 4K tokens. Sinusoidal loses quickly, RoPE needs fine-tuning, ALiBi works zero-shot on 100K+.

Practical Implementation Details

RoPE Implementation:
Q and K are rotated:

q[d_i:d_i+2] = Rotate2D(q[d_i:d_i+2], θ_i · pos)
k[d_i:d_i+2] = Rotate2D(k[d_i:d_i+2], θ_i · pos)

Rotate2D([x, y], θ) = [x·cos(θ) - y·sin(θ), x·sin(θ) + y·cos(θ)]

ALiBi Implementation:
After QK^T simply add bias:

scores = (Q @ K^T) / √d_k
bias = -m · absolute_distance_matrix
scores += bias

The slopes m for different heads are often: m_h = 2^-8h/H where h ∈ [0, H-1] and H is the number of heads.

Key Insights

🔄 Relative Positions

Both RoPE and ALiBi encode relative position (m-n), not absolute. That's the key to extrapolation.

📏 Context Window Limits

Trained on 4K doesn't mean "max 4K". With the right position encodings, extrapolation to 1M+ is possible.

⚡ Zero-Shot vs Fine-Tuning

ALiBi allows zero-shot extrapolation. RoPE needs ~1000 FT steps, but is more stable after FT.

🎯 Head Specialization

With ALiBi: Each head has its own slope m. Allows different distance sensitivities.

🚀 Modern Standard

RoPE is standard in modern models (Llama 3, Mistral). ALiBi is also increasingly used due to zero-shot advantages.

💰 Computational Costs

ALiBi is easier to implement. RoPE needs more rotation operations, but isn't significantly more expensive.