The Context Extension Problem

LLMs are typically trained on 4K tokens. The question is: How can they extrapolate to 100K+ tokens?

The problem: Position information is tightly coupled to training sequence length. Sinusoidal Positional Encodings (original Transformer paper) scale poorly to longer sequences.

There are two modern solutions:

  • RoPE (Rotary Position Embedding): Position information through rotation. Requires Position Interpolation fine-tuning
  • ALiBi (Attention with Linear Biases): Linear biases on attention scores. Zero-shot extrapolation possible

RoPE – Rotary Position Embedding

RoPE encodes position through rotation of 2D vectors. The critical point: The attention score between position m and n depends only on the relative distance (m-n), not on absolute positions.

RoPE Rotation Matrix:

[q_m, k_n] = q_m · R(θ(m)) · k_n · R(θ(n))T

where θ(pos) = Σi θi · 10000-2i/d

Attention depends only on (m-n), not on m or n alone

Position Interpolation for Extrapolation:
By scaling positions, a model trained on 4K can be extrapolated to 100K+:

  • Original Position 1-4000 → Scaled to 1-100000
  • Only ~1000 fine-tuning steps needed
  • No quality loss on longer sequences
Fig. 1 | Position Interpolation: Original positions 1-4000 are stretched to 1-100000. The scaling enables extrapolation without massive fine-tuning.

ALiBi – Attention with Linear Biases

ALiBi is conceptually simpler: Instead of trainable position embeddings, it adds static linear biases directly to attention scores based on token distance.

ALiBi Attention:

Attention(Q, K, V) = softmax((QKT) / √dk + bias_matrix) · V

bias[i,j] = -m · |i - j|

where m is a head-specific slope (not trained)

Major Advantages:

  • Zero-Shot Extrapolation: Model trained on 1K, works on 2K+, 100K+, 1M+ without adjustment
  • No trainable parameters: m is fixed, not trained
  • Head Specialization: Each head has its own slope m, learns different distance sensitivities
  • Easy to implement: A constant subtraction
Fig. 2 | ALiBi Bias Matrix: Diagonal is 0 (self), then linearly more negative with distance. Different heads show different slopes m (different colors = different m values).
4K Tokens

Comparison: RoPE vs ALiBi vs Sinusoidal

Feature Sinusoidal (Original) RoPE (Llama, Mistral) ALiBi (MPT, PaLM)
Mechanism sin/cos functions Vector rotation Linear biases
Relative Position Implicit Explicit ✓ Explicit ✓
Trainable Parameters None None None (only fixed m)
Extrapolation without FT ❌ Very poor ❌ Needs FT ✓ Works directly
Extrapolation with FT ⚠ Poor ✓ Excellent ✓ Excellent
Computational Cost Low Higher (rotations) Very low
Fig. 3 | Extrapolation Test: Perplexity over sequence length. Training on 4K tokens. Sinusoidal loses quickly, RoPE needs fine-tuning, ALiBi works zero-shot on 100K+.

Practical Implementation Details

RoPE Implementation:
Q and K are rotated:

q[d_i:d_i+2] = Rotate2D(q[d_i:d_i+2], θ_i · pos)
k[d_i:d_i+2] = Rotate2D(k[d_i:d_i+2], θ_i · pos)

Rotate2D([x, y], θ) = [x·cos(θ) - y·sin(θ), x·sin(θ) + y·cos(θ)]

ALiBi Implementation:
After QKT simply add bias:

scores = (Q @ KT) / √dk
bias = -m · absolute_distance_matrix
scores += bias

The slopes m for different heads are often: mh = 2-8h/H where h ∈ [0, H-1] and H is the number of heads.

Key Insights

🔄 Relative Positions

Both RoPE and ALiBi encode relative position (m-n), not absolute. That's the key to extrapolation.

📏 Context Window Limits

Trained on 4K doesn't mean "max 4K". With the right position encodings, extrapolation to 1M+ is possible.

⚡ Zero-Shot vs Fine-Tuning

ALiBi allows zero-shot extrapolation. RoPE needs ~1000 FT steps, but is more stable after FT.

🎯 Head Specialization

With ALiBi: Each head has its own slope m. Allows different distance sensitivities.

🚀 Modern Standard

RoPE is standard in modern models (Llama 3, Mistral). ALiBi is also increasingly used due to zero-shot advantages.

💰 Computational Costs

ALiBi is easier to implement. RoPE needs more rotation operations, but isn't significantly more expensive.