Attention Complexity – LLM Explorer

📊 Memory Complexity

Memory required (GB) | Log-Scale

Std. Attention O(n²)

Flash Attention O(n)

GQA + Sparse O(n×k)

⚡ Compute Complexity

Relative Compute Time (normalized to 2K)

Std. Attention

Sliding Window

DSA (Sparse)

🎯 Attention Matrix

Size of Attention Matrix for current sequence

💾 KV-Cache (Llama 3 70B)

Memory for Key-Value Caches

⏱️ Relative Time

vs. 2K-Token Baseline (theoretical)

🚀 Speedup with DSA

Speedup through Sparse Attention

📐

Why O(n²)?

Each token queries every other token in the sequence for similarity. That's n × n comparisons. With n = 128K, that's 16 billion operations per layer!

📦

KV-Cache Bottleneck

For each token, Key and Value vectors must be stored. With d=8192 dimensions and 80 layers, that quickly becomes 100+ GB for 1M tokens.

🔧

Flash Attention (2022)

IO-aware algorithm: Store in GPU-SRAM, not HBM. Same results, but O(n) memory instead of O(n²). Only 2x real speedup, but memory savings are enormous.

🎯

GQA Reduction

Grouped Query Attention: KV-Heads share (64 Query / 8 KV = 8x reduction). Llama 3 70B uses this. KV-Cache becomes 8x smaller without major quality loss.

⚡

Sliding Window (2024)

Local attention only: Token looks only at the last W tokens (e.g., W=4096). Complexity becomes O(n×W) instead of O(n²). With large window practically similar, but VRAM is much smaller.

🌟

DSA / Sparse (2025)

Deep Sparse Attention: Router selects only top-k relevant tokens (e.g., k=256 from 1M). Complexity becomes O(n×k). Llama 4 Scout & Maverick achieve 1M+ context practically with this.

Key Insights

2x sequence length = 4x more resources: This is not linear, it's quadratic. With 2K tokens you might need 4GB RAM, with 8K already 64GB. That's why GPT-3 only had 2K tokens.

Memory is the real limit: Not compute (with GPUs/TPUs), but VRAM. Modern GPUs have 40-80GB (H100/A100), 1M tokens with O(n²) would require petabytes.

Flash Attention is not "faster", just more memory-efficient: Same number of operations, but better use of GPU memory hierarchy. Practical speedup: 2-3x, memory savings: up to 10x.

GQA is the "low-hanging fruit": 8x KV-Cache reduction without major quality loss. Almost all modern models use this now (Llama, Mixtral, Deepseek).

Sliding Window works surprisingly well: With 4K window you lose little quality, but memory/compute becomes O(n×W) instead of O(n²). LLaMA 2 used this, Mistral too.

DSA is the game-changer for 1M+: Instead of looking at all n tokens, only top-256 (or 512). Enables 1M context practically. DeepSeek-V3.2 (2025) demonstrates this live with DSA.

Attention Complexity: The O(n²) Problem

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways

Key Insights