How the router network decides which expert subnetworks to activate for each token – the key to efficient models with trillions of parameters.
MoE Routers select the most relevant experts for each token. Instead of activating all parameters, only Top-k experts are used — an efficiency multiplier that makes models with trillions of parameters practical.
Chapter 2 begins with MoE as a fundamental alternative to the dense architecture from Chapter 1. MoE allows using more parameters with the same compute — the key to scaling to trillions of parameters.
Mixtral 8×7B uses only 2 of 8 experts per token: 13B active parameters, but 47B total. GPT-4 presumably uses 16 experts for ~1.76T parameters. Without MoE, such model sizes would be practically untrainable.
| Model | Experts | Top-k | Total Params |
|---|---|---|---|
| Mixtral 8x7B | 8 | 2 | 47B |
| DeepSeek V3 | 256 | 8 | 671B |
| Grok-1 | 8 | 2 | 314B |
| GPT-4 (estimated) | 16 | 2 | ~1.76T |
| Llama 4 Scout | 8 | 2 | 109B |
| Llama 4 Maverick | 128 | 8 | 400B |
| Llama 4 Behemoth | 16 | 2 | 2T |