8K Tokens
📊 Memory Complexity
Memory required (GB) | Log-Scale
Std. Attention O(n²)
GQA + Sparse O(n×k)
⚡ Compute Complexity
Relative Compute Time (normalized to 2K)
Std. Attention
Sliding Window
DSA (Sparse)
🎯 Attention Matrix
-
Size of Attention Matrix for current sequence
💾 KV-Cache (Llama 3 70B)
-
Memory for Key-Value Caches
⏱️ Relative Time
-
vs. 2K-Token Baseline (theoretical)
🚀 Speedup with DSA
-
Speedup through Sparse Attention
📐
Why O(n²)?
Each token queries every other token in the sequence for similarity. That's n × n comparisons. With n = 128K, that's 16 billion operations per layer!
📦
KV-Cache Bottleneck
For each token, Key and Value vectors must be stored. With d=8192 dimensions and 80 layers, that quickly becomes 100+ GB for 1M tokens.
🔧
Flash Attention (2022)
IO-aware algorithm: Store in GPU-SRAM, not HBM. Same results, but O(n) memory instead of O(n²). Only 2x real speedup, but memory savings are enormous.
🎯
GQA Reduction
Grouped Query Attention: KV-Heads share (64 Query / 8 KV = 8x reduction). Llama 3 70B uses this. KV-Cache becomes 8x smaller without major quality loss.
Sliding Window (2024)
Local attention only: Token looks only at the last W tokens (e.g., W=4096). Complexity becomes O(n×W) instead of O(n²). With large window practically similar, but VRAM is much smaller.
🌟
DSA / Sparse (2025)
Deep Sparse Attention: Router selects only top-k relevant tokens (e.g., k=256 from 1M). Complexity becomes O(n×k). Llama 4 Scout & Maverick achieve 1M+ context practically with this.

Key Insights

1
2x sequence length = 4x more resources: This is not linear, it's quadratic. With 2K tokens you might need 4GB RAM, with 8K already 64GB. That's why GPT-3 only had 2K tokens.
2
Memory is the real limit: Not compute (with GPUs/TPUs), but VRAM. Modern GPUs have 40-80GB (H100/A100), 1M tokens with O(n²) would require petabytes.
3
Flash Attention is not "faster", just more memory-efficient: Same number of operations, but better use of GPU memory hierarchy. Practical speedup: 2-3x, memory savings: up to 10x.
4
GQA is the "low-hanging fruit": 8x KV-Cache reduction without major quality loss. Almost all modern models use this now (Llama, Mixtral, Deepseek).
5
Sliding Window works surprisingly well: With 4K window you lose little quality, but memory/compute becomes O(n×W) instead of O(n²). LLaMA 2 used this, Mistral too.
6
DSA is the game-changer for 1M+: Instead of looking at all n tokens, only top-256 (or 512). Enables 1M context practically. DeepSeek-V3.2 (2025) demonstrates this live with DSA.