DeepSeek Sparse Attention (DSA): Move the sparsity slider and compare Dense Attention (chaotic, memory-intensive) with Sparse Attention (structured, 70% less memory).
Sparse Attention computes only relevant token pairs instead of the full n×n matrix. Dense Semantic Attention (DSA) uses ML to identify the most important connections — with 70% less memory at minimal quality loss.
Step 4/5 in Chapter 2 "Modern Architecture Variants"
After Flash Attention: Algorithmic optimization for extreme context lengths (200K+). DSA makes 1M+ token contexts practical.
DSA is production-ready in Claude 4.5 for 200K+ contexts. Selects only ~10-20% of token pairs, without measurable quality loss for most tasks.