KV-Cache Memory Growth – LLM Explorer

Memory Growth During Generation

How the KV-Cache grows during generation: Interactive visualization with and without GQA

Memory Growth visualizes the problem that all context window extensions must address: Each new token adds Key and Value vectors. With 1M context, this adds up to gigabytes per request – the reason for Sparse Attention, Sliding Windows, and Paged Attention.

📖 Learning Context ▼

Visualize the linear growth of the KV-Cache during generation
See the difference between MHA and GQA in memory consumption
Understand why long contexts are problematic for batching

Step 1/6 Optimizations & Memory

Connects KV-Cache animation and calculator into a dynamic picture. Shows why the following optimizations (Position Encoding, Sliding Window, Paged Attention) became necessary.

This growth explains why Claude offers 200K context, but ChatGPT stayed at 8K for a long time. It's not the model size, but the KV-Cache that limits context length.

Linear Growth: Memory grows proportionally to sequence length, not quadratically
GQA Effect: With 8 instead of 32 KV-Heads, Llama 3 saves 75% KV-Cache memory
Batch Problem: With batch size 32, memory requirements multiply by factor 32

The KV-Cache Growth Problem

Linear Growth: The KV-Cache grows linearly with the number of generated tokens. With 128 tokens, the cache is already 128× larger than with 1 token. This is the main memory bottleneck for long sequences.

Practical Problem: Llama 2 70B with MHA requires ~67 GB (FP16) for 128 token context. That's more than a single 80GB A100 GPU. With GQA, it fits on one GPU (~8 GB for KV-Cache at 128 tokens).

GQA is a Game-Changer: Through head-sharing, we can reduce the KV-Cache by up to 8× (8 KV-Heads instead of 64). This enables longer context windows on the same hardware.

Specific Numbers: Each additional token costs about 65 KB KV-Cache for Llama 2 70B (with GQA). With 100K token context, that would be 6.5 GB just for KV-Cache – plus other overheads.

Scaling Strategies: To handle this, we use: (1) GQA for reduction, (2) Flash Attention for memory-efficient computation, (3) Ring Attention for distribution across GPUs, (4) KV-Quantization to save bits per parameter.

Later Scaling Laws: Longer context × more parameters = exponentially more memory needed. This is one of the biggest drivers for specialized hardware and inference frameworks (vLLM, TensorRT, etc).

Memory Growth During Generation

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

The KV-Cache Growth Problem