Interactive Attention Type Comparison
Query Heads
8
Full capacity
KV Heads
8
Same as Query
KV-Cache Size
12.8
GB
Llama 2 70B @ 4K context
Query Heads (Q)
Key Heads (K)
Value Heads (V)
Query Heads 8
Sequence Length 4,096
Layers 80
With MHA, each Query head has its own Key-Value pair. This means maximum expressiveness but also maximum memory requirements for the KV-Cache during inference.
Fig. 1 | Comparison of attention variants. MHA: Each Query head (blue) has dedicated Key-Value heads (orange/green). GQA: Multiple Query heads share a KV pair. MQA: All Query heads share a single KV pair.
🔎 Why KV-Cache is a Problem

During autoregressive generation, the Key and Value vectors of all previous tokens must be kept in memory. With long contexts, this cache grows enormously.

Cache = 2 × n_kv × L × S × dhead × precision

For Llama 2 70B with 128K context: 64+ GB just for the KV-Cache!

The GQA Compromise

GQA groups Query heads and lets them share Key-Value heads. Llama 2 70B uses 64 Query heads with only 8 KV heads.

Result: 8× less KV-Cache memory with nearly unchanged model quality. An ideal trade-off between MHA and MQA.

📈 Models Compared
Model Q Heads KV Heads Type
GPT-3 175B 96 96 MHA
Llama 2 70B 64 8 GQA
Llama 3 70B 64 8 GQA
Mistral 7B 32 8 GQA
Falcon 180B 232 1 MQA