Interactive MoE Routing Visualization
Tokens Processed
0
of 8 tokens
Active Parameters
25%
per token (k=2 of 8)
Compute Savings
4x
vs. dense model
Load Balancing
💡 How does routing work?
Each token passes through the router network – a small neural network that computes a score for each expert. The Top-k experts with the highest scores are activated, their outputs weighted and combined.
G(x) = Softmax(TopK(x · Wrouter)) · y = Σi∈TopK G(x)i · Ei(x)
Fig. 1 | Sparse Mixture of Experts Routing. The router network assigns each token to the Top-k experts. Only these experts are activated – with k=2 of 8 experts, only 25% of the FFN parameters are used per token, while the model has access to 4x more parameters.
Why MoE?
  • More parameters, less compute: Mixtral 8x7B has 47B parameters but uses only ~13B per token
  • Scalability: GPT-4 presumably uses MoE for 1.76T parameters
  • Specialization: Experts learn different aspects of language
⚖️ Load Balancing
  • Problem: Without balancing, few experts become overloaded
  • Auxiliary Loss: Penalizes uneven distribution during training
  • Capacity Factor: Limits tokens per expert (typical: 1.25)
📊 Models with MoE
Model Experts Top-k Total Params
Mixtral 8x7B 8 2 47B
DeepSeek V3 256 8 671B
Grok-1 8 2 314B
GPT-4 (estimated) 16 2 ~1.76T
Llama 4 Scout 8 2 109B
Llama 4 Maverick 128 8 400B
Llama 4 Behemoth 16 2 2T