Fig. 1 | Comparison: Dense Model (13B parameters, 13B active) vs MoE (47B parameters, 13B active). The "active" compute is identical, but MoE can store 3.6x more parameters.

Dense Model (e.g. GPT-3)

Total Parameters 13B
Active Parameters 13B (100%)
Memory Required 26 GB (FP16)
Compute per Token 26B FLOPS

MoE Model (e.g. Mixtral 8x7B)

Total Parameters 47B
Active Parameters 13B (27%)
Memory Required 94 GB (FP16, all 8)
Compute per Token 26B FLOPS ✓ Same!

The MoE Advantage: Why Does This Work?

1
Sparse Activation: In a dense model, every parameter is used for every token. In an MoE model, the Router activates only a subset of experts (e.g., top-2 of 8). This reduces actual compute linearly.
2
Memory vs Compute Trade-off: All expert parameters must be kept in memory (94 GB), but only 27% are used per token. This is practical in large cluster environments with multiple GPUs: experts can be distributed across different devices.
3
Specialization: With 8 experts, different experts can specialize in: grammar, semantics, code, entities, mathematics, etc. This enables finer control and better performance on specialized tasks.
4
Scaling Law: According to the Chinchilla optimum, model size and data quantity should scale proportionally. MoE allows asymmetric scaling: parameters can be added "cheaply" as long as experts remain specialized.
5
Practical Benefit: An MoE with 13B active compute is faster than a dense 13B model of equivalent quality, because the additional inactive parameters still contribute to better representation (more capacity for different concepts).
6
Model Size Comparison: Mixtral 8x7B has 47B parameters but is often benchmarked as "equivalent to GPT-3 13B" (due to 13B active parameters). DeepSeek R1 has 671B parameters but only 37B active – an 18x "enlargement" with the same compute!

Model Comparison Table

Model Architecture Total Parameters Active Parameters Ratio Release
GPT-3 DENSE Dense Transformer 175B 175B 1:1 (100%) 2020
Mixtral 8x7B MoE 8 Expert MoE 47B 13B 3.6:1 (27%) 2023
Mixtral 8x22B MoE 8 Expert MoE 141B 39B 3.6:1 (27%) 2024
Grok-1 MoE ? Expert MoE 314B ~86B ~3.6:1 2023
DeepSeek R1 MoE Multi-Expert MoE 671B 37B 18.1:1 (5.5%) 2024