The Problem: O(n²) Complexity

Standard Self-Attention has quadratic complexity O(n²) in memory and time. For long sequences, this quickly becomes impossible:

32K Tokens
Fig. 1 | Memory and Time Complexity: Standard Attention (red) grows quadratically. Sliding Window (blue) remains linear. The slider shows memory requirements at different context lengths.

The Attention Matrix Problem

At 128K tokens: 128K × 128K = 16.3 billion numbers. At 32-bit precision = 64 GB RAM just for Attention!

KV-Cache Overhead

During inference, the model must store all previous K/V vectors. This grows uncontrollably with sequences.

Latency Problem

Each new token must be compared with all previous ones. Time = O(n) per token, total O(n²) for sequence.

Sliding Window Attention: The Solution

Idea: Tokens don't need to communicate with the entire history. Instead, each token can only communicate with the last W tokens. This reduces complexity from O(n²) to O(n×W).

Fig. 2 | Attention Pattern: Left Standard Attention (complete triangle). Right Sliding Window with W=4 (only diagonal band). The window "rolls" to the right with each new sequence position.

Mathematical Definition

Sliding Window Attention Calculation
Position i can only attend to positions max(0, i-W) to i-1.

Attention(i) = softmax( Q[i] · K[max(0,i-W):i]T / √d ) · V[max(0,i-W):i]

Window "slides": Old tokens outside of W are ignored.

Memory and Speed Savings

Example: Mistral 7B with W=4,096

Metric Standard Attention Sliding Window (W=4K) Savings
100K Tokens Attention: 10 billion ops 409 million ops ~24×
Memory (KV) 20 GB (unlimited) 800 MB (bounded) ~25×
Latency/Token O(n) O(W) = O(1) Linear

✓ Advantages

  • Dramatic speedup (24-100×)
  • Bounded memory usage
  • True streaming possible
  • Easy to implement

⚠ Limitations

  • Tokens > W are not visible
  • No direct long-range dependencies
  • Potential quality loss
  • Fixed window (not adaptive)

Effective Receptive Field: Information Propagates Through Layers

A token can only see W tokens in a single layer. But information propagates upward through the network! This is the crucial point: The effective receptive field = W × number of layers.

Fig. 3 | Effective Receptive Field grows with depth. Layer 1: W tokens directly. Layer 2: 2×W (information hops W tokens per layer). Mistral 7B: 32 layers × 4K = 131K effective reach.
Effective Receptive Field Formula
ERF(L) = W × L

Example Mistral 7B:
W = 4,096 tokens
L = 32 layers
ERF = 4,096 × 32 = 131,072 tokens

→ Token at the end can theoretically influence 131K historical tokens!

Why Does This Work?

Comparison: Modern Models and Their Reach

Model Window Size W Layers L Effective Reach Practical
Mistral 7B 4,096 32 131,072 Long-Docs OK
Llama 2 70B Full (no SWA) 80 4,096 (training) Pre-trained limit
Llama 3 8,192 (variant) 80 ~655K Very Long Docs

Ring Attention: For Even Longer Contexts (Multi-GPU)

Sliding Window helps with single GPUs. But for 1M+ token contexts, where you need full attention (not just local), researchers use Ring Attention.

The idea: Distribute the sequence across N GPUs in a ring topology. K/V blocks circulate in the ring, while each GPU processes its Query tokens with all K/V blocks.

Fig. 4 | Ring Attention Topology: N GPUs in a ring. K/V blocks circulate, each GPU computes attention with all K/V (sequentially). Context length scales linearly with GPU count.

How Ring Attention Works

Ring Attention Workflow
Step 1: GPU i has Q[i], K[i], V[i]
Step 2: GPU i computes Attention(Q[i], K[i], V[i])
Step 3: K[i], V[i] are sent to GPU (i+1)
Step 4: GPU (i+1) receives new K/V, computes
Attention(Q[i+1], K[just received], V[just received])
Step 5: Process repeats N times

After N steps, each GPU has interacted with all blocks.
Full Attention Matrix is reconstructed (distributed across GPUs).

Scaling with Ring Attention

📈 Scaling Laws

Context Length ∝ Number of GPUs

With N GPUs: Context = (Single GPU Context) × N

2 GPU = 2× Context, 4 GPU = 4×, etc.

💾 Memory Efficiency

Each GPU stores only 1/N of the sequence.

Memory per GPU stays constant!

Enables 1M tokens with bounded memory.

🔄 Communication

Ring topology minimizes bandwidth.

Full attention achieved (no approximation).

But: ~10-30% latency overhead.

Scaling to 1M+ Token Contexts

To truly enable long contexts, multiple techniques are combined:

Fig. 5 | Stacked Optimizations for Long-Context: Flash Attention 2/3 (Memory Efficiency), GQA/MQA (KV-Cache Reduction), Quantization, Position Interpolation, Ring Attention (Multi-GPU), yields ~16× memory efficiency total.

Combined Optimization Pipeline

Optimization Savings Technique
Flash Attention 2/3 ~4× Memory IO-Aware Attention Computation
GQA (8 KV-Heads) ~8× KV-Cache Grouped Query Attention (Llama 2 70B uses this)
INT8 Quantization ~2× KV-Cache KV-Cache at 8-bit instead of 16-bit
Combined Effect ~64× total Multiplies: 4 × 8 × 2 = 64

Practical Examples

Gemini 1.5

1M+ tokens

Proprietary optimizations likely based on Flash Attention + MQA + Distributed Attention.

Claude 3

200K tokens

Efficient attention mechanisms + context window design for practical workloads.

GPT-4 Turbo

128K tokens

Optimized inference with efficient attention mechanisms for long prompts.

Implementation Details and Variants

Sliding Window vs. Alternatives

Technique Formula Complexity Quality
Full Attention Attn(Q, K, V) O(n²) Baseline
Sliding Window Attn(Q, K[-W:], V[-W:]) O(n×W) -0-3%
Strided Attn Local + every k-th O(n×W + n²/k) -2-5%
Blockwise Blocks only talk to blocks O(n²/B²) -5-10%
Longformer Local + global tokens O(n×W + n×G) -1-4%

StreamingLLM: Attention Sink Tokens

A newer variant: Keep initial tokens forever. Observation: The first 4 tokens receive disproportionately much attention across all positions. These are "Sink Tokens".

StreamingLLM Mechanics
KV-Cache Strategy:
Keep: Tokens 0-3 (Sink) + last W-4 tokens (Window)
Discard: Everything in between (Tokens 4 to (n-W))

Result: Unlimited streams with only (4 + W) tokens in cache.
Quality with streaming: >95% (almost full quality despite eviction!)

H2O (Heavy Hitter Oracle)

Dynamic: Keep tokens with highest attention score. Eviction based on content, not just position.

Compressive Transformer

Compress distant tokens (pooling). Attention to compressed representations for old parts.

Key Insights

1️⃣ Local Attention

Not all tokens need to talk to all. Local patterns are sufficient for many tasks.

2️⃣ Information Propagation

Information spreads through layers. ERF = W × L enables effectively long contexts.

3️⃣ Memory Wins

Sliding Window Memory O(n) instead of O(n²). Simple but powerful for practical scenarios.

4️⃣ Ring for Multi-GPU

Distributed attention via Ring Topology enables true 1M+ tokens with full attention.

5️⃣ Combine, Don't Replace

Flash Attention + GQA + Quantization + Position Encoding = 64× better. Full stack needed.

6️⃣ Sink Tokens Are Clever

Some tokens are "more important" (higher attention). Conscious eviction >> simple windows.