How modern LLMs enable unlimited contexts with local attention and distributed compute
Sliding Window Attention breaks the O(n²) barrier through a simple observation: Not every token needs to see every other token. A local window of 4K-8K tokens is sufficient for most dependencies – and over 32+ layers, an effective context of 128K+ is built up.
Sliding Window is the bridge between KV-Cache optimization and distributed computing. Mistral and Mixtral use 4K windows to achieve 32K effective context at O(n) memory.
Mistral 7B beats Llama 2 13B – partly because Sliding Window Attention enables more efficiency on the same GPU. The technique is a prerequisite for practical 1M+ context windows.
Standard Self-Attention has quadratic complexity O(n²) in memory and time. For long sequences, this quickly becomes impossible:
At 128K tokens: 128K × 128K = 16.3 billion numbers. At 32-bit precision = 64 GB RAM just for Attention!
During inference, the model must store all previous K/V vectors. This grows uncontrollably with sequences.
Each new token must be compared with all previous ones. Time = O(n) per token, total O(n²) for sequence.
Idea: Tokens don't need to communicate with the entire history. Instead, each token can only communicate with the last W tokens. This reduces complexity from O(n²) to O(n×W).
Example: Mistral 7B with W=4,096
| Metric | Standard Attention | Sliding Window (W=4K) | Savings |
|---|---|---|---|
| 100K Tokens | Attention: 10 billion ops | 409 million ops | ~24× |
| Memory (KV) | 20 GB (unlimited) | 800 MB (bounded) | ~25× |
| Latency/Token | O(n) | O(W) = O(1) | Linear |
A token can only see W tokens in a single layer. But information propagates upward through the network! This is the crucial point: The effective receptive field = W × number of layers.
| Model | Window Size W | Layers L | Effective Reach | Practical |
|---|---|---|---|---|
| Mistral 7B | 4,096 | 32 | 131,072 | Long-Docs OK |
| Llama 2 70B | Full (no SWA) | 80 | 4,096 (training) | Pre-trained limit |
| Llama 3 | 8,192 (variant) | 80 | ~655K | Very Long Docs |
Sliding Window helps with single GPUs. But for 1M+ token contexts, where you need full attention (not just local), researchers use Ring Attention.
The idea: Distribute the sequence across N GPUs in a ring topology. K/V blocks circulate in the ring, while each GPU processes its Query tokens with all K/V blocks.
Context Length ∝ Number of GPUs
With N GPUs: Context = (Single GPU Context) × N
2 GPU = 2× Context, 4 GPU = 4×, etc.
Each GPU stores only 1/N of the sequence.
Memory per GPU stays constant!
Enables 1M tokens with bounded memory.
Ring topology minimizes bandwidth.
Full attention achieved (no approximation).
But: ~10-30% latency overhead.
To truly enable long contexts, multiple techniques are combined:
| Optimization | Savings | Technique |
|---|---|---|
| Flash Attention 2/3 | ~4× Memory | IO-Aware Attention Computation |
| GQA (8 KV-Heads) | ~8× KV-Cache | Grouped Query Attention (Llama 2 70B uses this) |
| INT8 Quantization | ~2× KV-Cache | KV-Cache at 8-bit instead of 16-bit |
| Combined Effect | ~64× total | Multiplies: 4 × 8 × 2 = 64 |
1M+ tokens
Proprietary optimizations likely based on Flash Attention + MQA + Distributed Attention.
200K tokens
Efficient attention mechanisms + context window design for practical workloads.
128K tokens
Optimized inference with efficient attention mechanisms for long prompts.
| Technique | Formula | Complexity | Quality |
|---|---|---|---|
| Full Attention | Attn(Q, K, V) | O(n²) | Baseline |
| Sliding Window | Attn(Q, K[-W:], V[-W:]) | O(n×W) | -0-3% |
| Strided Attn | Local + every k-th | O(n×W + n²/k) | -2-5% |
| Blockwise | Blocks only talk to blocks | O(n²/B²) | -5-10% |
| Longformer | Local + global tokens | O(n×W + n×G) | -1-4% |
A newer variant: Keep initial tokens forever. Observation: The first 4 tokens receive disproportionately much attention across all positions. These are "Sink Tokens".
Dynamic: Keep tokens with highest attention score. Eviction based on content, not just position.
Compress distant tokens (pooling). Attention to compressed representations for old parts.
Not all tokens need to talk to all. Local patterns are sufficient for many tasks.
Information spreads through layers. ERF = W × L enables effectively long contexts.
Sliding Window Memory O(n) instead of O(n²). Simple but powerful for practical scenarios.
Distributed attention via Ring Topology enables true 1M+ tokens with full attention.
Flash Attention + GQA + Quantization + Position Encoding = 64× better. Full stack needed.
Some tokens are "more important" (higher attention). Conscious eviction >> simple windows.