CHAPTER 4.2e · CONTEXT & LENGTH EXTRAPOLATION

Position Encoding and Context Extrapolation

How different Positional Encoding methods handle sequences longer than their training length – RoPE, ALiBi, and Sinusoidal compared

Context Extrapolation tests the limits of position encoding. A model trained on 4K tokens – does it work at 16K? 64K? This visualization shows why ALiBi extrapolates natively, RoPE scales with extensions, and Sinusoidal fails catastrophically.

📖 Learning Context ▼

Compare the extrapolation problem of different position encodings
Understand why Sinusoidal fails on out-of-distribution positions
Follow the role of Position Interpolation with RoPE

Step 3/6 Optimizations & Memory

Complements the RoPE/ALiBi visualizations with a practical test. Explains why modern models can be extended from 4K to 128K+ – and which techniques enable this.

OpenAI and Anthropic offer 200K+ contexts – but the models were trained on much shorter sequences. The ability to extrapolate is crucial for practical applications like document analysis.

Sinusoidal: Fails on unknown positions – OOD problem
ALiBi: Extrapolates natively, but quality drops with distance
RoPE + YaRN: Interpolation instead of extrapolation – more reliable for long contexts

Scenario: Model trained on 4K tokens, tested on longer sequences (without fine-tuning)

Extrapolation Performance Heatmap

Good Performance (>95% Quality)

Moderate (<85%)

Poor (<70%)

Accuracy Drop vs Extrapolation Ratio

Sinusoidal (poor extrapolation)

RoPE + Position Interpolation

ALiBi (best extrapolation)

Fig. 1 | Heatmap shows how three position encoding methods handle length extrapolation. ALiBi (bottom) is stable across all extrapolation ratios. RoPE is moderate. Sinusoidal collapses quickly after 2× training length.

🔴 Sinusoidal (Original Transformer)

Formula: PE(pos, 2i) = sin(pos / 10000^2i/d)
Encoded through Fourier frequencies. Position is used directly.

Max Extrapolation: 1.5-2.0×

Accuracy at 4×: 45%

Fine-tuning needed: Yes (1K+ steps)

Models: Original T5, BERT

🟡 RoPE (Rotary Position Embedding)

Formula: Position as rotation in 2D subspaces
Uses Position Interpolation: Indices are scaled to fit within trained range.

Max Extrapolation: 4-8×

Accuracy at 8×: 82%

Fine-tuning needed: 1000 steps recommended

Models: Llama, Mistral, PaLM

🟢 ALiBi (Attention with Linear Biases)

Formula: softmax(Q·K^T + m·[-(i-1),...,0])
Linear biases directly in attention. Head-specific slopes.

Max Extrapolation: 32×+ without drop!

Accuracy at 16×: 91%

Fine-tuning needed: No!

Models: BLOOM, MPT

Metric	Sinusoidal	RoPE	ALiBi
Training Sequence Length	4K	4K	1024 (BLOOM)
Safe up to (no drop)	4K (1.0×)	32K with PI (8×)	128K+ (128×!)
Accuracy at 8K (2×)	78%	88%	96%
Accuracy at 32K (8×)	52%	82%	91%
Accuracy at 128K (32×)	25% (collapses)	65% (without PI)	88%
Fine-tuning (1K steps)	+20pp possible	+12pp possible	+2pp (already optimal)
Computational Cost	Low	Medium (rotations)	Very Low
Memory Overhead	None	None	None
Head-dependent slopes?	No	No	Yes (different m per head)
Recommended For	Short sequences (<4K)	Medium (<128K) with PI	Very long (>128K) contexts

📏

ALiBi extrapolates 32× without fine-tuning

BLOOM trained on 1024 tokens, runs on 32K+. Sinusoidal: collapses at 2×. RoPE with interpolation: 8× possible. ALiBi is the exceptional design for length generalization.

🔧

RoPE with Position Interpolation is practical

Scale position indices (pos_new = pos_old × (training_len / target_len)). 1K fine-tuning steps are enough for 8-32× extrapolation. Modern standard for large models.

⚠️

Sinusoidal collapses quickly

At 4× training length: 52% accuracy. At 8×: 25%. The Fourier frequencies are not designed for extrapolation. All modern models use RoPE or ALiBi.

📊

Accuracy drop is position-dependent

Early tokens (start of context): usually stable. Middle/late: drop more intense. The U-curve effect (Lost-in-the-Middle) gets worse with extrapolation.

💡

Head-specific slopes make ALiBi robust

ALiBi slopes m = 1/2^(8i/h) per head i. Each head "learns" a different distance sensitivity. This enables linear extrapolation over arbitrary lengths.

🎯

Choice depends on use case

RAG/Long-Context: ALiBi or RoPE. Training efficiency: RoPE (mainstream). Maximum length: ALiBi. All three are in production models in 2025.

Position Encoding and Context Extrapolation

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways