How MoE models can have more parameters with the same compute cost as dense models
Parameter vs. Compute shows the central trade-off in MoE models: They have many parameters, but only a fraction is active for each token. This enables "cheap" large models with the same inference speed.
Scaling Laws (1/5) explains the relationship between parameters and compute for different architectures.
Models like Mixtral (8x7B) and Llama 4 Maverick (17B/400B) show: MoE can deliver dense-level performance at a fraction of the cost. This changes the economics of LLMs.
| Model | Architecture | Total Parameters | Active Parameters | Ratio | Release |
|---|---|---|---|---|---|
| GPT-3 DENSE | Dense Transformer | 175B | 175B | 1:1 (100%) | 2020 |
| Mixtral 8x7B MoE | 8 Expert MoE | 47B | 13B | 3.6:1 (27%) | 2023 |
| Mixtral 8x22B MoE | 8 Expert MoE | 141B | 39B | 3.6:1 (27%) | 2024 |
| Grok-1 MoE | ? Expert MoE | 314B | ~86B | ~3.6:1 | 2023 |
| DeepSeek R1 MoE | Multi-Expert MoE | 671B | 37B | 18.1:1 (5.5%) | 2024 |