Fig. 1 | MoE Load Balancing Visualization. Left: Balanced – Tokens are evenly distributed across experts. Right: Imbalanced – All tokens route to Expert 1 (bottleneck, red warning).

✓ Optimal Scenario: Balanced

Tokens are evenly distributed across all experts. Each expert processes about 20% of tokens per layer.
Tokens per Expert 20% each
GPU Utilization 80% (optimal)
Throughput Maximum
Latency Minimal

✗ Problem Scenario: Imbalanced

All tokens route to Expert 1. The system becomes a dense model with communication overhead.
Tokens on Expert 1 80%
Tokens on others 5% each
GPU Utilization Expert 1 100% (Bottleneck)
Latency 3-5× higher

Why is Load Imbalance a Problem?

1
GPU Utilization Inefficiency: If Expert 1 is 100% utilized and Experts 2-8 only 5%, then average utilization = (100 + 5 + 5 + 5 + 5 + 5 + 5 + 5) / 8 = 15.6%. The other GPUs are idle and do no work.
2
Bottleneck Effect: Expert 1 determines the overall throughput rate. All other experts must wait for Expert 1. Latency is dominated by the slowest component (critical path).
3
Network Overhead: In distributed systems (multiple GPUs), tokens and outputs must be transferred between devices. With imbalance, network links to Expert 1 become saturated while others are empty.
4
Router Learning Problem: The router learns to optimize via gradient descent. If the router systematically prefers Top-2 Expert 1, the loss function receives no signal for correction.
5
Solution Approaches: Modern MoE systems use auxiliary loss to enforce balance. Additional regularization: L_aux = α × (Σ_i P_i × E_i), where P_i = average expert selection, E_i = expert utilization.
6
Practical Observation: In real training, load imbalance can lead to 2-3× longer training times. DeepSeek and Mixtral have their own strategies: Expert Dropout during training, Dynamic Expert Selection.