Scenario: Model trained on 4K tokens, tested on longer sequences (without fine-tuning)
Extrapolation Performance Heatmap
Good Performance (>95% Quality)
Moderate (<85%)
Poor (<70%)
Accuracy Drop vs Extrapolation Ratio
Sinusoidal (poor extrapolation)
ALiBi (best extrapolation)
Fig. 1 | Heatmap shows how three position encoding methods handle length extrapolation. ALiBi (bottom) is stable across all extrapolation ratios. RoPE is moderate. Sinusoidal collapses quickly after 2× training length.
🔴 Sinusoidal (Original Transformer)
Formula: PE(pos, 2i) = sin(pos / 100002i/d)
Encoded through Fourier frequencies. Position is used directly.
Max Extrapolation: 1.5-2.0×
Accuracy at 4×: 45%
Fine-tuning needed: Yes (1K+ steps)
Models: Original T5, BERT
🟡 RoPE (Rotary Position Embedding)
Formula: Position as rotation in 2D subspaces
Uses Position Interpolation: Indices are scaled to fit within trained range.
Max Extrapolation: 4-8×
Accuracy at 8×: 82%
Fine-tuning needed: 1000 steps recommended
Models: Llama, Mistral, PaLM
🟢 ALiBi (Attention with Linear Biases)
Formula: softmax(Q·KT + m·[-(i-1),...,0])
Linear biases directly in attention. Head-specific slopes.
Max Extrapolation: 32×+ without drop!
Accuracy at 16×: 91%
Fine-tuning needed: No!
Models: BLOOM, MPT
Metric Sinusoidal RoPE ALiBi
Training Sequence Length 4K 4K 1024 (BLOOM)
Safe up to (no drop) 4K (1.0×) 32K with PI (8×) 128K+ (128×!)
Accuracy at 8K (2×) 78% 88% 96%
Accuracy at 32K (8×) 52% 82% 91%
Accuracy at 128K (32×) 25% (collapses) 65% (without PI) 88%
Fine-tuning (1K steps) +20pp possible +12pp possible +2pp (already optimal)
Computational Cost Low Medium (rotations) Very Low
Memory Overhead None None None
Head-dependent slopes? No No Yes (different m per head)
Recommended For Short sequences (<4K) Medium (<128K) with PI Very long (>128K) contexts
📏
ALiBi extrapolates 32× without fine-tuning
BLOOM trained on 1024 tokens, runs on 32K+. Sinusoidal: collapses at 2×. RoPE with interpolation: 8× possible. ALiBi is the exceptional design for length generalization.
🔧
RoPE with Position Interpolation is practical
Scale position indices (pos_new = pos_old × (training_len / target_len)). 1K fine-tuning steps are enough for 8-32× extrapolation. Modern standard for large models.
⚠️
Sinusoidal collapses quickly
At 4× training length: 52% accuracy. At 8×: 25%. The Fourier frequencies are not designed for extrapolation. All modern models use RoPE or ALiBi.
📊
Accuracy drop is position-dependent
Early tokens (start of context): usually stable. Middle/late: drop more intense. The U-curve effect (Lost-in-the-Middle) gets worse with extrapolation.
💡
Head-specific slopes make ALiBi robust
ALiBi slopes m = 1/2^(8i/h) per head i. Each head "learns" a different distance sensitivity. This enables linear extrapolation over arbitrary lengths.
🎯
Choice depends on use case
RAG/Long-Context: ALiBi or RoPE. Training efficiency: RoPE (mainstream). Maximum length: ALiBi. All three are in production models in 2025.