Calculate the memory requirements of the Key-Value cache at different sequence lengths, precisions, and batch sizes. Compare MHA vs. GQA vs. MQA.
This calculator shows the practical reality of KV-Cache memory. A 70B model with 128K context needs over 40 GB just for the cache – often more than the model weights themselves. GQA and MQA are not optimizations, they are necessities.
Complements the KV-Cache animation with concrete numbers. Explains why modern models must use GQA and how batch size affects memory.
GPU memory is the bottleneck in LLM inference. 200K context with Llama 3 needs ~25 GB cache alone – more than an A100-40GB allows for a single request. This calculation determines which models run on which hardware.