How the KV-Cache grows during generation: Interactive visualization with and without GQA
Memory Growth visualizes the problem that all context window extensions must address: Each new token adds Key and Value vectors. With 1M context, this adds up to gigabytes per request – the reason for Sparse Attention, Sliding Windows, and Paged Attention.
Connects KV-Cache animation and calculator into a dynamic picture. Shows why the following optimizations (Position Encoding, Sliding Window, Paged Attention) became necessary.
This growth explains why Claude offers 200K context, but ChatGPT stayed at 8K for a long time. It's not the model size, but the KV-Cache that limits context length.