The Chinchilla Recipe for Scaling

The Chinchilla Scaling Laws (DeepMind 2022) reveal a surprising finding: Historically, LLMs were undertrained – with far too little training data for the model size.

The optimal rule is: Compute budget should be split equally between model size and data. Specifically: For a given compute budget C, one should:

  • Parameters N ≈ C / 20D – The model uses about 1/20 of the compute budget
  • Tokens D ≈ 20N – About 20 training tokens per parameter (the "20x ratio")
Optimal Ratio: D ≈ 20N

Compute C ≈ 6ND (approximately)
→ N = sqrt(C / 120), D = 20N = 20 × sqrt(C / 120)

The key insight: A 10x smaller model with 10x more data achieves better quality with the same training time, than a 10x larger model trained less.

Fig. 1 | Chinchilla Scaling Laws: Iso-Loss curves in log-log space. Green diagonal lines show points of equal training time. The red point marks the optimum for the current compute budget. The ratio Parameters:Tokens ≈ 1:20.
10^23 FLOPs

Scaling Scenarios Compared

With the same training time, there are different ways to use compute:

Fig. 2 | Four scenarios: (1) Chinchilla-Optimal = best ratio 1:20. (2) Parameter-Heavy = 10x too many parameters, less data. (3) Data-Heavy = 10x too much data, smaller models. (4) Historical (GPT-3 style) = undertrained with 300:1 ratio.

Optimal vs. Suboptimal Allocation

Chinchilla (Optimal)
N:D = 1:20
GPT-3 Style
N:D = 1:300 ❌
Parameter-Heavy
N:D = 1:2 ❌
Data-Heavy
N:D = 1:200 ⚠️

Practical Implications

Chinchilla vs. GPT-3 of the same size:
A 70B Chinchilla model with 1.4T tokens beats a 70B GPT-3-style model with only 300B tokens significantly. The reason: Chinchilla trains on 1.4T tokens instead of 300B, thus "learns more".

Scaling curve:
Test error decreases as a power law: Error ∝ (N × D)^(-α) where α ≈ 0.07. This means: Doubling compute reduces error by a constant factor.

Modern models:
Llama 3 70B (trained following Chinchilla) demonstrates this advantage: Best ratio of parameters to tokens leads to superior quality. DeepSeek and other modern models follow similar principles.

Fig. 3 | Validation Loss over training steps: Chinchilla (1:20 optimal) converges faster and to better quality than GPT-3 style (1:300 undertrained) or Parameter-Heavy (1:2).

Key Insights

🎯 The 20x Ratio

Optimal is about 20 training tokens per parameter. This is more universal than previously thought.

⚖️ Equal Scaling

Compute should be split equally between parameters and tokens.

🔄 Undertraining is Costly

Too many parameters with too little data leads to worse final quality despite similar compute.

📊 Power-Law Scaling

Error decreases as power law with compute. Predictable and scalable across orders of magnitude.

💰 Practical Impact

Smaller, well-trained models beat large, undertrained models. Compute is expensive.

🧠 Emergent Capabilities

Larger models learn new capabilities, but better training amplifies the effect.