Model Configuration

Transformer blocks in the stack
Number of Query heads (Multi-Head Attention)
MHA: = h_q | GQA: < h_q | MQA: = 1
Dimension per head (typically 64, 80 or 128)
Number of tokens in context
Number of parallel sequences
Bytes per parameter in cache

KV-Cache Size

Cache per Token
0 KB
Cache for Sequence
0 MB
Cache with Batch
0 GB
K-Cache
0 GB
V-Cache
0 GB

MHA vs. GQA vs. MQA Comparison

MHA (Multi-Head)
0 GB
Baseline
GQA (Current)
0 GB
0% Savings
MQA (Single KV)
0 GB
0% Savings

Cache Growth with Sequence Length

KV-Cache Formula
Cache = 2 × Layers × KV_Heads × Head_Dim × Seq_Len × Bytes × Batch_Size

Factor 2: K and V stored separately
GQA Advantage: KV_Heads < Q_Heads → proportional reduction
MQA Maximum: KV_Heads = 1 → maximum savings
Why KV-Cache?
Without cache: Each new token → full attention over all previous (O(n²) per token). With cache: Only Query new, K/V from memory (O(n) per token). 5-10× speedup!
GQA Trade-off
Grouped Query Attention: Multiple Q-Heads share KV-Heads. Llama 2 70B: 64 Q → 8 KV = 8× cache reduction with <1% quality loss. Best of Both Worlds.
Precision Options
FP16 (Standard): Good balance. INT8: 2× memory reduction, minimal quality loss. INT4: 4× reduction, noticeable but often acceptable loss. FP32: Only for research.
GPU Limits
A100 80GB: ~64K Tokens (Llama 2 70B, FP16). H100 80GB: similar. With INT8: ~128K. With GQA instead of MHA: 8× more. Combination enables 1M+ token context.