Grouped Query Attention (GQA) Head-Sharing

Grouped Query Attention (GQA)

How sharing Key-Value heads drastically reduces KV-Cache memory requirements without significantly compromising model quality.

Grouped Query Attention (GQA) reduces KV-Cache memory through Key-Value sharing between Query heads. Instead of n K/V pairs, only n/g — a practical compromise between MHA (maximum capacity) and MQA (minimum memory).

📖 Learning Context ▼

Understand and compare the differences between MHA, MQA, and GQA
Calculate memory savings through head-sharing
Know the performance trade-offs with different groupings

Step 2/5 Modern Architecture Variants

After MoE (efficiency through expert selection), we now examine GQA: another efficiency optimization, this time at the attention level. GQA reduces the KV-Cache, which becomes a bottleneck with long contexts.

Llama 2 70B uses GQA with 8 KV heads for 32 Query heads — ~75% KV-Cache savings. With 128K context, the cache would otherwise be 64+ GB. GQA is now standard in Llama, Mistral, Gemma, and virtually all modern LLMs.

g Query heads share one K/V pair (g = group size)
g=1 → MQA (all share), g=h → MHA (no sharing), in between → GQA
Llama 2/3, Mistral, Gemma use GQA with 4-8× reduction

Interactive Attention Type Comparison

Head Assignment: Query to Key/Value

Query Heads

Full capacity

KV Heads
8
Same as Query
0% saved

KV-Cache Size

12.8

Llama 2 70B @ 4K context

Query Heads (Q)

Key Heads (K)

Value Heads (V)

Model Configuration

Query Heads 8

Sequence Length 4,096

Layers 80

Multi-Head Attention (MHA)

With MHA, each Query head has its own Key-Value pair. This means maximum expressiveness but also maximum memory requirements for the KV-Cache during inference.

Fig. 1 | Comparison of attention variants. MHA: Each Query head (blue) has dedicated Key-Value heads (orange/green). GQA: Multiple Query heads share a KV pair. MQA: All Query heads share a single KV pair.

🔎 Why KV-Cache is a Problem

During autoregressive generation, the Key and Value vectors of all previous tokens must be kept in memory. With long contexts, this cache grows enormously.

Cache = 2 × n_kv × L × S × d_head × precision

For Llama 2 70B with 128K context: 64+ GB just for the KV-Cache!

⚒ The GQA Compromise

GQA groups Query heads and lets them share Key-Value heads. Llama 2 70B uses 64 Query heads with only 8 KV heads.

Result: 8× less KV-Cache memory with nearly unchanged model quality. An ideal trade-off between MHA and MQA.

📈 Models Compared

Model	Q Heads	KV Heads	Type
GPT-3 175B	96	96	MHA
Llama 2 70B	64	8	GQA
Llama 3 70B	64	8	GQA
Mistral 7B	32	8	GQA
Falcon 180B	232	1	MQA

Grouped Query Attention (GQA)

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways