Chinchilla Scaling Laws

The optimal allocation of compute budget between model parameters and training data

Chinchilla Scaling Laws (DeepMind, 2022) revolutionized LLM training: Instead of just making models larger, Chinchilla showed that optimal performance requires equal scaling of both parameters AND training data.

📖 Learning Context ▼

Understand the Chinchilla optimal ratio (N ≈ 20D)
Grasp why GPT-3 was "undertrained"
Learn to optimally allocate compute budget

Step 1/5 Chapter 8: Tools & Glossary

Scaling Laws (1/5) form the foundation for Data (2/5), Calculators (3/5), Embeddings (4/5), and Research Tools (5/5).

Chinchilla changed the industry: Llama 2 was trained "Chinchilla-optimal" and despite fewer parameters is often better than larger undertrained models.

N ≈ 20D: Optimal tokens ≈ 20x parameters
GPT-3: 175B parameters, only 300B tokens (undertrained)
Chinchilla: 70B parameters, 1.4T tokens (optimal)

The Chinchilla Recipe for Scaling

The Chinchilla Scaling Laws (DeepMind 2022) reveal a surprising finding: Historically, LLMs were undertrained – with far too little training data for the model size.

The optimal rule is: Compute budget should be split equally between model size and data. Specifically: For a given compute budget C, one should:

Parameters N ≈ C / 20D – The model uses about 1/20 of the compute budget
Tokens D ≈ 20N – About 20 training tokens per parameter (the "20x ratio")

Optimal Ratio: D ≈ 20N

Compute C ≈ 6ND (approximately)
→ N = sqrt(C / 120), D = 20N = 20 × sqrt(C / 120)

The key insight: A 10x smaller model with 10x more data achieves better quality with the same training time, than a 10x larger model trained less.

Fig. 1 | Chinchilla Scaling Laws: Iso-Loss curves in log-log space. Green diagonal lines show points of equal training time. The red point marks the optimum for the current compute budget. The ratio Parameters:Tokens ≈ 1:20.

Compute Budget (FLOPs): 10^23 FLOPs

Scenario:

Scaling Scenarios Compared

With the same training time, there are different ways to use compute:

Fig. 2 | Four scenarios: (1) Chinchilla-Optimal = best ratio 1:20. (2) Parameter-Heavy = 10x too many parameters, less data. (3) Data-Heavy = 10x too much data, smaller models. (4) Historical (GPT-3 style) = undertrained with 300:1 ratio.

Optimal vs. Suboptimal Allocation

Chinchilla (Optimal)

N:D = 1:20

GPT-3 Style

N:D = 1:300 ❌

Parameter-Heavy

N:D = 1:2 ❌

Data-Heavy

N:D = 1:200 ⚠️

Practical Implications

Chinchilla vs. GPT-3 of the same size:
A 70B Chinchilla model with 1.4T tokens beats a 70B GPT-3-style model with only 300B tokens significantly. The reason: Chinchilla trains on 1.4T tokens instead of 300B, thus "learns more".

Scaling curve:
Test error decreases as a power law: Error ∝ (N × D)^(-α) where α ≈ 0.07. This means: Doubling compute reduces error by a constant factor.

Modern models:
Llama 3 70B (trained following Chinchilla) demonstrates this advantage: Best ratio of parameters to tokens leads to superior quality. DeepSeek and other modern models follow similar principles.

Fig. 3 | Validation Loss over training steps: Chinchilla (1:20 optimal) converges faster and to better quality than GPT-3 style (1:300 undertrained) or Parameter-Heavy (1:2).

Key Insights

🎯 The 20x Ratio

Optimal is about 20 training tokens per parameter. This is more universal than previously thought.

⚖️ Equal Scaling

Compute should be split equally between parameters and tokens.

🔄 Undertraining is Costly

Too many parameters with too little data leads to worse final quality despite similar compute.

📊 Power-Law Scaling

Error decreases as power law with compute. Predictable and scalable across orders of magnitude.

💰 Practical Impact

Smaller, well-trained models beat large, undertrained models. Compute is expensive.

🧠 Emergent Capabilities

Larger models learn new capabilities, but better training amplifies the effect.