How language models achieve better results through longer thinking and more compute
Test-Time Scaling: A new scaling axis — instead of more parameters, simply invest more compute during inference. The curves show: Quality scales log-linearly with thinking tokens.
After Hidden Reasoning: The quantitative perspective on thinking budget.
Test-Time Compute is cheaper than Pre-Training! A 70B model with 10× more inference compute can beat a 400B model. This changes the economics of AI.
Traditionally, we scale models along two dimensions: Model Size (more parameters) and Data (more training tokens).
But there's a third dimension that's often overlooked: Test-Time Compute - how much computing power we invest during inference (after the model is finished training).
This can happen through various strategies: letting the model think multiple times (Best-of-N), iterative refinement (Chain-of-Thought), or internal reasoning (o1-style).
Surprising Finding (Snell et al., 2024): With optimal test-time allocation, a 7B model can achieve better results than a 70B model without this optimization!
Move the slider to see how the optimal model size changes with budget.
At constant model size, we get better results when we invest more time/compute during inference:
Accuracy as a function of Test-Time Compute approximately follows a Power Law:
Test-Time Scaling is particularly valuable when:
It is not sensible when:
| Scenario | Best Strategy | Cost Comparison |
|---|---|---|
| Real-time application (Chat) | Small models, no Test-Time Scaling | Low, but quality limited |
| Offline Batch Processing | Best-of-N or Sequential | Moderate, but high quality |
| Critical tasks (Medicine, Law) | Large model + Verification | High, but necessary |
| Research/Development | o1-style (internal) | Higher, but best quality |
Question: Where are the limits of Test-Time Scaling?
The Power Law might eventually reach a saturation point - where more compute no longer helps. Where this point lies is still unclear.
Prediction: The next breakthrough in AI won't come from even larger models, but from smarter test-time algorithms that sensibly decide when and how long a model should think.