CHAPTER 4.2d · EFFICIENCY

Sliding Window & Ring Attention

How modern LLMs enable unlimited contexts with local attention and distributed compute

Sliding Window Attention breaks the O(n²) barrier through a simple observation: Not every token needs to see every other token. A local window of 4K-8K tokens is sufficient for most dependencies – and over 32+ layers, an effective context of 128K+ is built up.

📖 Learning Context ▼

Understand the quadratic scaling problem of standard attention
Comprehend how local windows enable linear complexity
Recognize how layer stacking extends the effective context

Step 3/6 Optimizations & Memory

Sliding Window is the bridge between KV-Cache optimization and distributed computing. Mistral and Mixtral use 4K windows to achieve 32K effective context at O(n) memory.

Mistral 7B beats Llama 2 13B – partly because Sliding Window Attention enables more efficiency on the same GPU. The technique is a prerequisite for practical 1M+ context windows.

Window Size w: Each token only sees the w previous tokens (typically: 4096-8192)
Effective Context: With L layers, the model sees L × w tokens (32 × 4K = 128K)
Memory Savings: KV-Cache grows only to window size, then stays constant

The Problem: O(n²) Complexity

Standard Self-Attention has quadratic complexity O(n²) in memory and time. For long sequences, this quickly becomes impossible:

Context Length: 32K Tokens

Fig. 1 | Memory and Time Complexity: Standard Attention (red) grows quadratically. Sliding Window (blue) remains linear. The slider shows memory requirements at different context lengths.

The Attention Matrix Problem

At 128K tokens: 128K × 128K = 16.3 billion numbers. At 32-bit precision = 64 GB RAM just for Attention!

KV-Cache Overhead

During inference, the model must store all previous K/V vectors. This grows uncontrollably with sequences.

Latency Problem

Each new token must be compared with all previous ones. Time = O(n) per token, total O(n²) for sequence.

Sliding Window Attention: The Solution

Idea: Tokens don't need to communicate with the entire history. Instead, each token can only communicate with the last W tokens. This reduces complexity from O(n²) to O(n×W).

Fig. 2 | Attention Pattern: Left Standard Attention (complete triangle). Right Sliding Window with W=4 (only diagonal band). The window "rolls" to the right with each new sequence position.

Mathematical Definition

Sliding Window Attention Calculation

Position i can only attend to positions max(0, i-W) to i-1.

Attention(i) = softmax( Q[i] · K[max(0,i-W):i]^T / √d ) · V[max(0,i-W):i]

Window "slides": Old tokens outside of W are ignored.

Memory and Speed Savings

Example: Mistral 7B with W=4,096

Metric	Standard Attention	Sliding Window (W=4K)	Savings
100K Tokens	Attention: 10 billion ops	409 million ops	~24×
Memory (KV)	20 GB (unlimited)	800 MB (bounded)	~25×
Latency/Token	O(n)	O(W) = O(1)	Linear

✓ Advantages

Dramatic speedup (24-100×)
Bounded memory usage
True streaming possible
Easy to implement

⚠ Limitations

Tokens > W are not visible
No direct long-range dependencies
Potential quality loss
Fixed window (not adaptive)

Effective Receptive Field: Information Propagates Through Layers

A token can only see W tokens in a single layer. But information propagates upward through the network! This is the crucial point: The effective receptive field = W × number of layers.

Fig. 3 | Effective Receptive Field grows with depth. Layer 1: W tokens directly. Layer 2: 2×W (information hops W tokens per layer). Mistral 7B: 32 layers × 4K = 131K effective reach.

Effective Receptive Field Formula

ERF(L) = W × L

Example Mistral 7B:
W = 4,096 tokens
L = 32 layers
ERF = 4,096 × 32 = 131,072 tokens

→ Token at the end can theoretically influence 131K historical tokens!

Why Does This Work?

Layer 1: Token i directly sees tokens (i-W) to i
Layer 2: Token i sees outputs from Layer 1, which themselves saw more tokens → 2W reach
Layer 3: Another hop → 3W reach
...Layer L: Cumulative → L×W reach

Comparison: Modern Models and Their Reach

Model	Window Size W	Layers L	Effective Reach	Practical
Mistral 7B	4,096	32	131,072	Long-Docs OK
Llama 2 70B	Full (no SWA)	80	4,096 (training)	Pre-trained limit
Llama 3	8,192 (variant)	80	~655K	Very Long Docs

Ring Attention: For Even Longer Contexts (Multi-GPU)

Sliding Window helps with single GPUs. But for 1M+ token contexts, where you need full attention (not just local), researchers use Ring Attention.

The idea: Distribute the sequence across N GPUs in a ring topology. K/V blocks circulate in the ring, while each GPU processes its Query tokens with all K/V blocks.

Number of GPUs:

Fig. 4 | Ring Attention Topology: N GPUs in a ring. K/V blocks circulate, each GPU computes attention with all K/V (sequentially). Context length scales linearly with GPU count.

How Ring Attention Works

Ring Attention Workflow

Step 1: GPU i has Q[i], K[i], V[i]
Step 2: GPU i computes Attention(Q[i], K[i], V[i])
Step 3: K[i], V[i] are sent to GPU (i+1)
Step 4: GPU (i+1) receives new K/V, computes
Attention(Q[i+1], K[just received], V[just received])
Step 5: Process repeats N times

After N steps, each GPU has interacted with all blocks.
Full Attention Matrix is reconstructed (distributed across GPUs).

Scaling with Ring Attention

📈 Scaling Laws

Context Length ∝ Number of GPUs

With N GPUs: Context = (Single GPU Context) × N

2 GPU = 2× Context, 4 GPU = 4×, etc.

💾 Memory Efficiency

Each GPU stores only 1/N of the sequence.

Memory per GPU stays constant!

Enables 1M tokens with bounded memory.

🔄 Communication

Ring topology minimizes bandwidth.

Full attention achieved (no approximation).

But: ~10-30% latency overhead.

Scaling to 1M+ Token Contexts

To truly enable long contexts, multiple techniques are combined:

Fig. 5 | Stacked Optimizations for Long-Context: Flash Attention 2/3 (Memory Efficiency), GQA/MQA (KV-Cache Reduction), Quantization, Position Interpolation, Ring Attention (Multi-GPU), yields ~16× memory efficiency total.

Combined Optimization Pipeline

Optimization	Savings	Technique
Flash Attention 2/3	~4× Memory	IO-Aware Attention Computation
GQA (8 KV-Heads)	~8× KV-Cache	Grouped Query Attention (Llama 2 70B uses this)
INT8 Quantization	~2× KV-Cache	KV-Cache at 8-bit instead of 16-bit
Combined Effect	~64× total	Multiplies: 4 × 8 × 2 = 64

Practical Examples

Gemini 1.5

1M+ tokens

Proprietary optimizations likely based on Flash Attention + MQA + Distributed Attention.

Claude 3

200K tokens

Efficient attention mechanisms + context window design for practical workloads.

GPT-4 Turbo

128K tokens

Optimized inference with efficient attention mechanisms for long prompts.

Implementation Details and Variants

Sliding Window vs. Alternatives

Technique	Formula	Complexity	Quality
Full Attention	Attn(Q, K, V)	O(n²)	Baseline
Sliding Window	Attn(Q, K[-W:], V[-W:])	O(n×W)	-0-3%
Strided Attn	Local + every k-th	O(n×W + n²/k)	-2-5%
Blockwise	Blocks only talk to blocks	O(n²/B²)	-5-10%
Longformer	Local + global tokens	O(n×W + n×G)	-1-4%

StreamingLLM: Attention Sink Tokens

A newer variant: Keep initial tokens forever. Observation: The first 4 tokens receive disproportionately much attention across all positions. These are "Sink Tokens".

StreamingLLM Mechanics

KV-Cache Strategy:
Keep: Tokens 0-3 (Sink) + last W-4 tokens (Window)
Discard: Everything in between (Tokens 4 to (n-W))

Result: Unlimited streams with only (4 + W) tokens in cache.
Quality with streaming: >95% (almost full quality despite eviction!)

H2O (Heavy Hitter Oracle)

Dynamic: Keep tokens with highest attention score. Eviction based on content, not just position.

Compressive Transformer

Compress distant tokens (pooling). Attention to compressed representations for old parts.

Key Insights

1️⃣ Local Attention

Not all tokens need to talk to all. Local patterns are sufficient for many tasks.

2️⃣ Information Propagation

Information spreads through layers. ERF = W × L enables effectively long contexts.

3️⃣ Memory Wins

Sliding Window Memory O(n) instead of O(n²). Simple but powerful for practical scenarios.

4️⃣ Ring for Multi-GPU

Distributed attention via Ring Topology enables true 1M+ tokens with full attention.

5️⃣ Combine, Don't Replace

Flash Attention + GQA + Quantization + Position Encoding = 64× better. Full stack needed.

6️⃣ Sink Tokens Are Clever

Some tokens are "more important" (higher attention). Conscious eviction >> simple windows.