1.0x
The Problem: Without KV-Cache, attention must be recalculated over ALL previous tokens for each new token. That's inefficient!

The Solution: Store Key and Value vectors of all previous tokens. For new tokens: Only compute Query, combine with cached K and V.
1
Initialization: Input tokens are being embedded
Query (Q)
Newly computed
Key (K) Cache
0 Tokens
Value (V) Cache
0 Tokens

Cache Growth

Sequence Length: 0
K Cache Size: 0 KB
V Cache Size: 0 KB
Total KV-Cache: 0 KB
Cache Growth (this token)
0%

Computational Efficiency

Attention Complexity: O(n)
Q recompute: O(d²)
K·V Computation: O(1) cached
Speedup vs without Cache: ~10x

Memory Formula for KV-Cache

Cache Bytes = 2 × Layers × Heads × Head_Dim × Seq_Len × Bytes_Per_Token

Example (Llama 2 7B):

2 × 32 Layers × 32 Heads × 128 Dim × Seq_Len × 2 Bytes (FP16)

= 524,288 Bytes per Token = ~512 KB per Token

At 32K Tokens = ~16 GB

Fig. 1 | The memory size of the KV-Cache grows linearly with sequence length. Without Cache: O(n²), with Cache: O(n)

With vs. Without KV-Cache

Feature Without KV-Cache With KV-Cache
Attention per Token O(n²) at n tokens O(n) only Query × K
Memory Usage Efficient (only input) O(Layers × Heads × Seq_Len)
Generation Speed Slow (recalculate everything) ~5-10x faster
Practical Application Rarely used Standard in all LLMs

KV-Cache Size in Real Models

Model Size Layers Heads KV-Cache / Token (FP16) At 32K Tokens
Claude 4.5 ~100B 80 64 ~1.5 MB ~48 GB (200K ctx) ✓
Qwen 3 ~100B 96 32 ~1 MB ~32 GB (256K ctx) ✓
Llama 4 Maverick 400B (17B active) 128 64 ~2.5 MB ~80 GB (1M ctx) ✓
Gemini 3 ~600B 256 128 ~4 MB ~128 GB (10M ctx) ✓

Note: The KV-Cache becomes the bottleneck for long contexts. That's why optimizations like GQA (Chapter 2.2), Quantization and Sparse Attention (Chapter 2.4) are essential. Sparse Attention dramatically reduces cache size by only storing relevant token pairs — enabling 1M+ token contexts at practical memory requirements.