Mixture of Experts (MoE) Router Simulation

How the router network decides which expert subnetworks to activate for each token – the key to efficient models with trillions of parameters.

MoE Routers select the most relevant experts for each token. Instead of activating all parameters, only Top-k experts are used — an efficiency multiplier that makes models with trillions of parameters practical.

📖 Learning Context ▼

Understand how the router mechanism selects experts for each token
Follow the Top-k selection and weighted combination of expert outputs
Know the trade-off between efficiency and complexity in MoE architectures

Step 1/5 Modern Architecture Variants

Chapter 2 begins with MoE as a fundamental alternative to the dense architecture from Chapter 1. MoE allows using more parameters with the same compute — the key to scaling to trillions of parameters.

Mixtral 8×7B uses only 2 of 8 experts per token: 13B active parameters, but 47B total. GPT-4 presumably uses 16 experts for ~1.76T parameters. Without MoE, such model sizes would be practically untrainable.

Router = small gating network that computes softmax scores for each expert
Top-k (usually k=2) activates only a subset of experts per token
Experts specialize in different aspects: syntax, semantics, entities

Interactive MoE Routing Visualization

Top-k:

Experts:

Input Tokens (click to select)

Router Network & Expert Assignment

Expert Subnetworks (FFN Layers)

Tokens Processed

of 8 tokens

Active Parameters

25%

per token (k=2 of 8)

Compute Savings

vs. dense model

Load Balancing

Expert Utilization (Load Balancing)

💡 How does routing work?

Each token passes through the router network – a small neural network that computes a score for each expert. The Top-k experts with the highest scores are activated, their outputs weighted and combined.

G(x) = Softmax(TopK(x · W_router)) · y = Σ_i∈TopK G(x)_i · E_i(x)

Fig. 1 | Sparse Mixture of Experts Routing. The router network assigns each token to the Top-k experts. Only these experts are activated – with k=2 of 8 experts, only 25% of the FFN parameters are used per token, while the model has access to 4x more parameters.

⚡ Why MoE?

More parameters, less compute: Mixtral 8x7B has 47B parameters but uses only ~13B per token
Scalability: GPT-4 presumably uses MoE for 1.76T parameters
Specialization: Experts learn different aspects of language

⚖️ Load Balancing

Problem: Without balancing, few experts become overloaded
Auxiliary Loss: Penalizes uneven distribution during training
Capacity Factor: Limits tokens per expert (typical: 1.25)

📊 Models with MoE

Model	Experts	Top-k	Total Params
Mixtral 8x7B	8	2	47B
DeepSeek V3	256	8	671B
Grok-1	8	2	314B
GPT-4 (estimated)	16	2	~1.76T
Llama 4 Scout	8	2	109B
Llama 4 Maverick	128	8	400B
Llama 4 Behemoth	16	2	2T

Mixture of Experts (MoE) Router Simulation

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

Related Visualizations