32 / 128
Fig. 1 | Memory growth of the KV-Cache during token generation. X-axis: Generated tokens. Y-axis: Memory consumption in GB. With/Without GQA shows the effect of head-sharing.
📊 Example: Llama 2 70B
Layers: 80
Q-Heads: 64
KV-Heads (MHA): 64
KV-Heads (GQA): 8 (8× smaller!)
Head Dimension: 128
Memory per Token (MHA): ~523 KB
Memory per Token (GQA): ~65 KB

The KV-Cache Growth Problem

1
Linear Growth: The KV-Cache grows linearly with the number of generated tokens. With 128 tokens, the cache is already 128× larger than with 1 token. This is the main memory bottleneck for long sequences.
2
Practical Problem: Llama 2 70B with MHA requires ~67 GB (FP16) for 128 token context. That's more than a single 80GB A100 GPU. With GQA, it fits on one GPU (~8 GB for KV-Cache at 128 tokens).
3
GQA is a Game-Changer: Through head-sharing, we can reduce the KV-Cache by up to 8× (8 KV-Heads instead of 64). This enables longer context windows on the same hardware.
4
Specific Numbers: Each additional token costs about 65 KB KV-Cache for Llama 2 70B (with GQA). With 100K token context, that would be 6.5 GB just for KV-Cache – plus other overheads.
5
Scaling Strategies: To handle this, we use: (1) GQA for reduction, (2) Flash Attention for memory-efficient computation, (3) Ring Attention for distribution across GPUs, (4) KV-Quantization to save bits per parameter.
6
Later Scaling Laws: Longer context × more parameters = exponentially more memory needed. This is one of the biggest drivers for specialized hardware and inference frameworks (vLLM, TensorRT, etc).