LLM Explorer – KV-Cache Calculator

KV-Cache Calculator

Calculate the memory requirements of the Key-Value cache at different sequence lengths, precisions, and batch sizes. Compare MHA vs. GQA vs. MQA.

This calculator shows the practical reality of KV-Cache memory. A 70B model with 128K context needs over 40 GB just for the cache – often more than the model weights themselves. GQA and MQA are not optimizations, they are necessities.

📖 Learning Context ▼

Quantify memory requirements for different model configurations
Understand the impact of precision (FP16, INT8, INT4) on memory
Calculate GQA/MQA savings compared to classic MHA

Step 1/6 Optimizations & Memory

Complements the KV-Cache animation with concrete numbers. Explains why modern models must use GQA and how batch size affects memory.

GPU memory is the bottleneck in LLM inference. 200K context with Llama 3 needs ~25 GB cache alone – more than an A100-40GB allows for a single request. This calculation determines which models run on which hardware.

Formula: KV-Cache = 2 × Layers × KV-Heads × Head-Dim × Seq-Len × Bytes/Element
GQA Factor: With n_kv_heads < n_heads, GQA saves (n_heads/n_kv_heads)× memory
Batch Multiplier: Each concurrent request needs its own cache

Model Configuration

Number of Layers (L) Transformer blocks in the stack

Query Heads (h_q) Number of Query heads (Multi-Head Attention)

KV Heads (h_kv) MHA: = h_q | GQA: < h_q | MQA: = 1

Head Dimension (d_k) Dimension per head (typically 64, 80 or 128)

Sequence Length (Tokens) Number of tokens in context

Batch Size Number of parallel sequences

Precision Bytes per parameter in cache

KV-Cache Size

Cache per Token

0 KB

Cache for Sequence

0 MB

Cache with Batch

0 GB

K-Cache

0 GB

V-Cache

0 GB

KV-Cache Formula

Cache = 2 × Layers × KV_Heads × Head_Dim × Seq_Len × Bytes × Batch_Size

Factor 2: K and V stored separately
GQA Advantage: KV_Heads < Q_Heads → proportional reduction
MQA Maximum: KV_Heads = 1 → maximum savings

Why KV-Cache?

Without cache: Each new token → full attention over all previous (O(n²) per token). With cache: Only Query new, K/V from memory (O(n) per token). 5-10× speedup!

GQA Trade-off

Grouped Query Attention: Multiple Q-Heads share KV-Heads. Llama 2 70B: 64 Q → 8 KV = 8× cache reduction with <1% quality loss. Best of Both Worlds.

Precision Options

FP16 (Standard): Good balance. INT8: 2× memory reduction, minimal quality loss. INT4: 4× reduction, noticeable but often acceptable loss. FP32: Only for research.

GPU Limits

A100 80GB: ~64K Tokens (Llama 2 70B, FP16). H100 80GB: similar. With INT8: ~128K. With GQA instead of MHA: 8× more. Combination enables 1M+ token context.

KV-Cache Calculator

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

Model Configuration

KV-Cache Size

MHA vs. GQA vs. MQA Comparison

Cache Growth with Sequence Length