How different Positional Encoding methods handle sequences longer than their training length – RoPE, ALiBi, and Sinusoidal compared
Context Extrapolation tests the limits of position encoding. A model trained on 4K tokens – does it work at 16K? 64K? This visualization shows why ALiBi extrapolates natively, RoPE scales with extensions, and Sinusoidal fails catastrophically.
Complements the RoPE/ALiBi visualizations with a practical test. Explains why modern models can be extended from 4K to 128K+ – and which techniques enable this.
OpenAI and Anthropic offer 200K+ contexts – but the models were trained on much shorter sequences. The ability to extrapolate is crucial for practical applications like document analysis.
| Metric | Sinusoidal | RoPE | ALiBi |
|---|---|---|---|
| Training Sequence Length | 4K | 4K | 1024 (BLOOM) |
| Safe up to (no drop) | 4K (1.0×) | 32K with PI (8×) | 128K+ (128×!) |
| Accuracy at 8K (2×) | 78% | 88% | 96% |
| Accuracy at 32K (8×) | 52% | 82% | 91% |
| Accuracy at 128K (32×) | 25% (collapses) | 65% (without PI) | 88% |
| Fine-tuning (1K steps) | +20pp possible | +12pp possible | +2pp (already optimal) |
| Computational Cost | Low | Medium (rotations) | Very Low |
| Memory Overhead | None | None | None |
| Head-dependent slopes? | No | No | Yes (different m per head) |
| Recommended For | Short sequences (<4K) | Medium (<128K) with PI | Very long (>128K) contexts |