Parameter vs Compute Efficiency

How MoE models can have more parameters with the same compute cost as dense models

Parameter vs. Compute shows the central trade-off in MoE models: They have many parameters, but only a fraction is active for each token. This enables "cheap" large models with the same inference speed.

📖 Learning Context ▼

Understand the difference between Total and Active Parameters
Quantify MoE efficiency
Know the practical implications for inference

Step 1/5 Chapter 8: Tools & Glossary

Scaling Laws (1/5) explains the relationship between parameters and compute for different architectures.

Models like Mixtral (8x7B) and Llama 4 Maverick (17B/400B) show: MoE can deliver dense-level performance at a fraction of the cost. This changes the economics of LLMs.

Mixtral: 8x7B total, only 2x7B active = 4x efficiency
Llama 4: 400B total, 17B active = 24x efficiency
Trade-off: More memory, same latency

Fig. 1 | Comparison: Dense Model (13B parameters, 13B active) vs MoE (47B parameters, 13B active). The "active" compute is identical, but MoE can store 3.6x more parameters.

Dense Model (e.g. GPT-3)

Total Parameters 13B

Active Parameters 13B (100%)

Memory Required 26 GB (FP16)

Compute per Token 26B FLOPS

MoE Model (e.g. Mixtral 8x7B)

Total Parameters 47B

Active Parameters 13B (27%)

Memory Required 94 GB (FP16, all 8)

Compute per Token 26B FLOPS ✓ Same!

The MoE Advantage: Why Does This Work?

Sparse Activation: In a dense model, every parameter is used for every token. In an MoE model, the Router activates only a subset of experts (e.g., top-2 of 8). This reduces actual compute linearly.

Memory vs Compute Trade-off: All expert parameters must be kept in memory (94 GB), but only 27% are used per token. This is practical in large cluster environments with multiple GPUs: experts can be distributed across different devices.

Specialization: With 8 experts, different experts can specialize in: grammar, semantics, code, entities, mathematics, etc. This enables finer control and better performance on specialized tasks.

Scaling Law: According to the Chinchilla optimum, model size and data quantity should scale proportionally. MoE allows asymmetric scaling: parameters can be added "cheaply" as long as experts remain specialized.

Practical Benefit: An MoE with 13B active compute is faster than a dense 13B model of equivalent quality, because the additional inactive parameters still contribute to better representation (more capacity for different concepts).

Model Size Comparison: Mixtral 8x7B has 47B parameters but is often benchmarked as "equivalent to GPT-3 13B" (due to 13B active parameters). DeepSeek R1 has 671B parameters but only 37B active – an 18x "enlargement" with the same compute!

Model Comparison Table

Model	Architecture	Total Parameters	Active Parameters	Ratio	Release
GPT-3 DENSE	Dense Transformer	175B	175B	1:1 (100%)	2020
Mixtral 8x7B MoE	8 Expert MoE	47B	13B	3.6:1 (27%)	2023
Mixtral 8x22B MoE	8 Expert MoE	141B	39B	3.6:1 (27%)	2024
Grok-1 MoE	? Expert MoE	314B	~86B	~3.6:1	2023
DeepSeek R1 MoE	Multi-Expert MoE	671B	37B	18.1:1 (5.5%)	2024

Parameter vs Compute Efficiency

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways

Dense Model (e.g. GPT-3)

MoE Model (e.g. Mixtral 8x7B)

The MoE Advantage: Why Does This Work?

Model Comparison Table