How Key-Value vectors are cached during token generation to accelerate inference
KV-Cache is the key to fast token generation. Without caching, each new token would have to repeat the attention calculation for all previous tokens — an O(n²) problem. The KV-Cache stores already computed Key and Value vectors, so only new tokens need to be calculated.
The KV-Cache is the foundation for all further memory optimizations — from Grouped Query Attention to Paged Attention. It explains why context windows cost memory.
Without KV-Cache, ChatGPT would be 100x slower. With 4K tokens, each new token would repeat 4000 attention calculations instead of just one. The cache makes interactive conversations possible.
Example (Llama 2 7B):
2 × 32 Layers × 32 Heads × 128 Dim × Seq_Len × 2 Bytes (FP16)
= 524,288 Bytes per Token = ~512 KB per Token
At 32K Tokens = ~16 GB
| Feature | Without KV-Cache | With KV-Cache |
|---|---|---|
| Attention per Token | O(n²) at n tokens | O(n) only Query × K |
| Memory Usage | Efficient (only input) | O(Layers × Heads × Seq_Len) |
| Generation Speed | Slow (recalculate everything) | ~5-10x faster |
| Practical Application | Rarely used | Standard in all LLMs |
| Model | Size | Layers | Heads | KV-Cache / Token (FP16) | At 32K Tokens |
|---|---|---|---|---|---|
| Claude 4.5 | ~100B | 80 | 64 | ~1.5 MB | ~48 GB (200K ctx) ✓ |
| Qwen 3 | ~100B | 96 | 32 | ~1 MB | ~32 GB (256K ctx) ✓ |
| Llama 4 Maverick | 400B (17B active) | 128 | 64 | ~2.5 MB | ~80 GB (1M ctx) ✓ |
| Gemini 3 | ~600B | 256 | 128 | ~4 MB | ~128 GB (10M ctx) ✓ |
Note: The KV-Cache becomes the bottleneck for long contexts. That's why optimizations like GQA (Chapter 2.2), Quantization and Sparse Attention (Chapter 2.4) are essential. Sparse Attention dramatically reduces cache size by only storing relevant token pairs — enabling 1M+ token contexts at practical memory requirements.