Devices
4
Sequence Length Total
256K
Block/Device
64K
Communication Vol
128 GB
Compute-Comm Overlap
92%
Effective Context
256K Tokens
Fig. 1 | Ring Topology with 4-16 GPUs. Each GPU computes attention over its KV block while KV blocks circulate. Colors: Green=Compute Phase, Orange=Communication Phase. The overlap prevents communication from blocking the critical path.
Metric 4 GPUs 8 GPUs 16 GPUs Trend
Sequence Length/GPU 64K 32K 16K Linear with Devices
Total Sequence 256K 256K 256K Constant (scalable)
Local Attention Ops O(64K²) O(32K²) O(16K²) Quadratic per GPU
Communication Volume 512 GB (per phase) 512 GB (per phase) 512 GB (per phase) Constant (good!)
Phases 4 8 16 N = # Devices
Comm Latency ~80ms (NVLink) ~80ms ~80ms Constant (Ring)
Compute Time/Phase ~200ms ~50ms ~12ms Quadratic↓
Overlap % 88% 92% 96% Better with more GPUs
Effective Speedup 3.8× 7.2× 14.1× Near linear!
🔄
Ring Topology Scales Linearly
Sequence length per device decreases with 1/N, but total scales linearly. N GPUs can process 256K tokens with near 14× speedup at 16 devices. Key: Communication constant, computation scales.
Compute-Communication Overlap is Critical
GPU i computes KV₁ attention (Compute) while GPU i-1 sends KV₁ (Comm). Circulation happens in parallel with computation. Without overlap: only 50% GPU utilization. With overlap: 92-96% possible.
📊
Phase Structure: N Phases per Circulation
With 4 GPUs: 4 phases until all K-V blocks have circulated. Phase i: GPU j computes with KV₍ⱼ₊ᵢ₎ mod N. After phase N, all attention scores are computed. Deterministic and synchronizable.
🚀
Communication Volume Remains Constant
Per phase: ~512 GB (KV blocks circulate). Regardless of 4 or 16 GPUs. This is different from tree-reduction (would be logarithmic). Ring topology is ideal for attention workloads.
💡
Practically Limited by NVLink Bandwidth
H100 NVLink: 141 GB/s. 512 GB transfer takes ~3.6s. With good compute/comm overlap, phase takes ~200ms at 4 GPUs. Scales with more GPUs as compute time decreases faster than comm time.
🎯
8-16 GPUs are the Sweet Spot
4 GPUs: 88% overlap, 3.8× speedup. 16 GPUs: 96% overlap, 14.1× speedup. Beyond 16 GPUs: Network switching needed (instead of NVLink), latency increases. Intra-node: Ring optimal.