PagedAttention - KV-Cache Optimization

PagedAttention: Virtual Memory for KV-Cache

How PagedAttention optimizes KV-Cache memory usage through Virtual Memory Paging

PagedAttention brings Virtual Memory to the KV-Cache. Instead of reserving contiguous memory for each sequence, small pages are dynamically allocated. The result: Up to 4x more parallel requests with the same GPU memory – the foundation of vLLM.

📖 Learning Context ▼

Understand the fragmentation problem with continuous KV-Cache
Comprehend how paging improves memory efficiency
Recognize the analogy to OS Virtual Memory

Step 5/6 Optimizations & Memory

PagedAttention is the server-side answer to KV-Cache costs. While GQA reduces size, paging maximizes utilization of available memory.

vLLM is the de-facto standard for LLM serving. PagedAttention is its core feature and enables 2-4x higher throughput for production workloads with variable sequence lengths.

Fragmentation: Without paging, up to 60% of KV-Cache memory remains unused
Pages: Small blocks (e.g., 16 tokens) are dynamically allocated and freed
Block Table: Lookup table maps logical positions to physical pages

Virtual Memory Paging

PagedAttention divides the KV-Cache into logical pages (128-256 tokens per page) and manages them like Virtual Memory. Enables 90% memory reduction with same performance.

Memory Efficiency

Standard: KV tensors must be contiguous → external fragmentation. PagedAttention: Pages can be non-contiguous → optimal memory usage.

Batch Processing

Multiple sequences of different lengths require much padding with standard approach. PagedAttention shares pages → higher GPU utilization, better batch efficiency.

vLLM Implementation

Industry-standard implementation from UC Berkeley. Now in production at OpenAI, Databricks, Meta. Foundation for all modern KV-Cache management.

Latency Overhead

Thanks to GPU-optimized page lookups: <1% latency overhead. Standard: 2GB KV-Cache for 128K. PagedAttention: 200MB (10× smaller).

Production Ready

Tested with 1M token contexts, throughput improvements of 10-20×. Works with GPTQ quantization. New standard architecture for long contexts.

PagedAttention: Virtual Memory for KV-Cache

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

Standard KV-Cache (Contiguous)

PagedAttention (Virtual Memory)