What happens when all tokens route to the same experts: The critical load balancing problem in MoE systems
Load Balancing is the central challenge in MoE: When all tokens choose the same experts, bottlenecks emerge and most GPUs remain unused. Auxiliary loss enforces even distribution during training.
This page complements the router simulation with the training perspective. After understanding how routing works, we now examine why problems arise without regularization.
Without balancing: 1-2 experts overloaded, rest unused. GPU utilization drops to 15-20%. Auxiliary loss accounts for ~1% of total loss, but prevents expert collapse and 2-3× longer training times.