How sharing Key-Value heads drastically reduces KV-Cache memory requirements without significantly compromising model quality.
Grouped Query Attention (GQA) reduces KV-Cache memory through Key-Value sharing between Query heads. Instead of n K/V pairs, only n/g — a practical compromise between MHA (maximum capacity) and MQA (minimum memory).
After MoE (efficiency through expert selection), we now examine GQA: another efficiency optimization, this time at the attention level. GQA reduces the KV-Cache, which becomes a bottleneck with long contexts.
Llama 2 70B uses GQA with 8 KV heads for 32 Query heads — ~75% KV-Cache savings. With 128K context, the cache would otherwise be 64+ GB. GQA is now standard in Llama, Mistral, Gemma, and virtually all modern LLMs.
During autoregressive generation, the Key and Value vectors of all previous tokens must be kept in memory. With long contexts, this cache grows enormously.
For Llama 2 70B with 128K context: 64+ GB just for the KV-Cache!
GQA groups Query heads and lets them share Key-Value heads. Llama 2 70B uses 64 Query heads with only 8 KV heads.
Result: 8× less KV-Cache memory with nearly unchanged model quality. An ideal trade-off between MHA and MQA.
| Model | Q Heads | KV Heads | Type |
|---|---|---|---|
| GPT-3 175B | 96 | 96 | MHA |
| Llama 2 70B | 64 | 8 | GQA |
| Llama 3 70B | 64 | 8 | GQA |
| Mistral 7B | 32 | 8 | GQA |
| Falcon 180B | 232 | 1 | MQA |