How modern models extend their context windows: Rotary and Linear Bias Position Encoding
RoPE vs. ALiBi – two paths to the same goal: context extension beyond training length. RoPE rotates embeddings based on position, ALiBi adds linear biases. Llama, GPT-NeoX and most modern models use RoPE, BLOOM and some older models use ALiBi.
📖Learning Context▼
🎯
Learning Objectives
Understand the fundamental differences between RoPE and ALiBi
Recognize why RoPE dominates today (learnable vs. fixed positions)
Follow the context extension problem: Why 4K → 100K?
🗺️
Context: Where are we?
Step 2/6Optimizations & Memory
This overview connects the detail visualizations (RoPE Rotation, ALiBi Heatmap) and explains when each method is appropriate.
💡
Why It Matters
The choice between RoPE and ALiBi affects whether a model can be finetuned to longer contexts. RoPE allows Position Interpolation (4K → 32K without retraining), ALiBi extrapolates natively but loses precision.
📌
Key Takeaways
RoPE: Rotates Q/K vectors, relative position encoded through rotation
ALiBi: Adds fixed negative bias proportional to distance
Practice: RoPE with YaRN/NTK-Extension is today's standard for 128K+ contexts
The Context Extension Problem
LLMs are typically trained on 4K tokens. The question is: How can they extrapolate to 100K+ tokens?
The problem: Position information is tightly coupled to training sequence length.
Sinusoidal Positional Encodings (original Transformer paper) scale poorly to longer sequences.
There are two modern solutions:
RoPE (Rotary Position Embedding): Position information through rotation. Requires Position Interpolation fine-tuning
ALiBi (Attention with Linear Biases): Linear biases on attention scores. Zero-shot extrapolation possible
RoPE – Rotary Position Embedding
RoPE encodes position through rotation of 2D vectors.
The critical point: The attention score between position m and n depends only on
the relative distance (m-n), not on absolute positions.
RoPE Rotation Matrix:
[q_m, k_n] = q_m · R(θ(m)) · k_n · R(θ(n))T
where θ(pos) = Σi θi · 10000-2i/d
Attention depends only on (m-n), not on m or n alone
Position Interpolation for Extrapolation:
By scaling positions, a model trained on 4K can be extrapolated to 100K+:
Original Position 1-4000 → Scaled to 1-100000
Only ~1000 fine-tuning steps needed
No quality loss on longer sequences
Fig. 1 |Position Interpolation: Original positions 1-4000 are stretched to 1-100000. The scaling enables extrapolation without massive fine-tuning.
ALiBi – Attention with Linear Biases
ALiBi is conceptually simpler: Instead of trainable position embeddings, it adds
static linear biases directly to attention scores based on token distance.
Zero-Shot Extrapolation: Model trained on 1K, works on 2K+, 100K+, 1M+ without adjustment
No trainable parameters: m is fixed, not trained
Head Specialization: Each head has its own slope m, learns different distance sensitivities
Easy to implement: A constant subtraction
Fig. 2 |ALiBi Bias Matrix: Diagonal is 0 (self), then linearly more negative with distance. Different heads show different slopes m (different colors = different m values).
4K Tokens
Comparison: RoPE vs ALiBi vs Sinusoidal
Feature
Sinusoidal (Original)
RoPE (Llama, Mistral)
ALiBi (MPT, PaLM)
Mechanism
sin/cos functions
Vector rotation
Linear biases
Relative Position
Implicit
Explicit ✓
Explicit ✓
Trainable Parameters
None
None
None (only fixed m)
Extrapolation without FT
❌ Very poor
❌ Needs FT
✓ Works directly
Extrapolation with FT
⚠ Poor
✓ Excellent
✓ Excellent
Computational Cost
Low
Higher (rotations)
Very low
Fig. 3 |Extrapolation Test: Perplexity over sequence length. Training on 4K tokens. Sinusoidal loses quickly, RoPE needs fine-tuning, ALiBi works zero-shot on 100K+.