CHAPTER 4.1 · CONTEXT MECHANISMS

KV-Cache in Action

How Key-Value vectors are cached during token generation to accelerate inference

KV-Cache is the key to fast token generation. Without caching, each new token would have to repeat the attention calculation for all previous tokens — an O(n²) problem. The KV-Cache stores already computed Key and Value vectors, so only new tokens need to be calculated.

📖 Learning Context ▼

Understand why naive attention is inefficient during generation
Follow how the KV-Cache avoids redundant calculations
Recognize the trade-off between speed and memory consumption

Step 1/6 Optimizations & Memory

The KV-Cache is the foundation for all further memory optimizations — from Grouped Query Attention to Paged Attention. It explains why context windows cost memory.

Without KV-Cache, ChatGPT would be 100x slower. With 4K tokens, each new token would repeat 4000 attention calculations instead of just one. The cache makes interactive conversations possible.

Linear instead of quadratic time: Each new token needs only O(n) instead of O(n²) operations
Memory cost: The cache grows linearly with context length (2 × L × H × d per token)
GQA reduces: Grouped Query Attention shares Keys/Values between Heads and saves up to 8x memory

Speed

1.0x

Precision

The Problem: Without KV-Cache, attention must be recalculated over ALL previous tokens for each new token. That's inefficient!

The Solution: Store Key and Value vectors of all previous tokens. For new tokens: Only compute Query, combine with cached K and V.

Initialization: Input tokens are being embedded

Query (Q)

Newly computed

Key (K) Cache

0 Tokens

Value (V) Cache

0 Tokens

Cache Growth

Sequence Length: 0

K Cache Size: 0 KB

V Cache Size: 0 KB

Total KV-Cache: 0 KB

Cache Growth (this token)

Computational Efficiency

Attention Complexity: O(n)

Q recompute: O(d²)

K·V Computation: O(1) cached

Speedup vs without Cache: ~10x

Memory Formula for KV-Cache

Cache Bytes = 2 × Layers × Heads × Head_Dim × Seq_Len × Bytes_Per_Token

Example (Llama 2 7B):

2 × 32 Layers × 32 Heads × 128 Dim × Seq_Len × 2 Bytes (FP16)

= 524,288 Bytes per Token = ~512 KB per Token

At 32K Tokens = ~16 GB

Fig. 1 | The memory size of the KV-Cache grows linearly with sequence length. Without Cache: O(n²), with Cache: O(n)

With vs. Without KV-Cache

Feature	Without KV-Cache	With KV-Cache
Attention per Token	O(n²) at n tokens	O(n) only Query × K
Memory Usage	Efficient (only input)	O(Layers × Heads × Seq_Len)
Generation Speed	Slow (recalculate everything)	~5-10x faster
Practical Application	Rarely used	Standard in all LLMs

KV-Cache Size in Real Models

Model	Size	Layers	Heads	KV-Cache / Token (FP16)	At 32K Tokens
Claude 4.5	~100B	80	64	~1.5 MB	~48 GB (200K ctx) ✓
Qwen 3	~100B	96	32	~1 MB	~32 GB (256K ctx) ✓
Llama 4 Maverick	400B (17B active)	128	64	~2.5 MB	~80 GB (1M ctx) ✓
Gemini 3	~600B	256	128	~4 MB	~128 GB (10M ctx) ✓

Note: The KV-Cache becomes the bottleneck for long contexts. That's why optimizations like GQA (Chapter 2.2), Quantization and Sparse Attention (Chapter 2.4) are essential. Sparse Attention dramatically reduces cache size by only storing relevant token pairs — enabling 1M+ token contexts at practical memory requirements.

KV-Cache in Action

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

Cache Growth

Computational Efficiency

Memory Formula for KV-Cache

With vs. Without KV-Cache

KV-Cache Size in Real Models