The Chinchilla Scaling Laws (DeepMind 2022) reveal a surprising finding: Historically, LLMs were undertrained – with far too little training data for the model size.
The optimal rule is: Compute budget should be split equally between model size and data. Specifically: For a given compute budget C, one should:
- Parameters N ≈ C / 20D – The model uses about 1/20 of the compute budget
- Tokens D ≈ 20N – About 20 training tokens per parameter (the "20x ratio")