Test-Time Compute as Third Scaling Axis

Traditionally, we scale models along two dimensions: Model Size (more parameters) and Data (more training tokens).

But there's a third dimension that's often overlooked: Test-Time Compute - how much computing power we invest during inference (after the model is finished training).

This can happen through various strategies: letting the model think multiple times (Best-of-N), iterative refinement (Chain-of-Thought), or internal reasoning (o1-style).

Surprising Finding (Snell et al., 2024): With optimal test-time allocation, a 7B model can achieve better results than a 70B model without this optimization!

$100

Move the slider to see how the optimal model size changes with budget.

How Does Test-Time Scaling Work?

At constant model size, we get better results when we invest more time/compute during inference:

🔄 Parallel (Best-of-N)
Generate N complete answers in parallel and choose the best one
✓ Easy to implement
✗ Linear latency overhead
📝 Sequential (Iterative)
Generate answer → Check → Refine iteratively
✓ Better quality
✗ High latency
🧠 Internal (o1-style)
Model thinks internally before answering
✓ Hidden, good quality
✗ Costs, less controllable

The Scaling Law Formula

Accuracy as a function of Test-Time Compute approximately follows a Power Law:

Accuracy(c) = a - b × c^(-α) c = Test-Time Compute (e.g., FLOPs or seconds of thinking time) α ≈ 0.5 - 0.8 (depends on model and task) a, b = Constants Example: α = 0.6 Accuracy with 10x Compute is ~1.5% higher

When Is Test-Time Scaling Worth It?

Test-Time Scaling is particularly valuable when:

It is not sensible when:

🔑 Key Insight: With optimal test-time allocation, you can bring a smaller model to the quality of a 10x larger model - but with the same compute time on training day! The costs just shift from training to inference.

Practical Implications

Scenario Best Strategy Cost Comparison
Real-time application (Chat) Small models, no Test-Time Scaling Low, but quality limited
Offline Batch Processing Best-of-N or Sequential Moderate, but high quality
Critical tasks (Medicine, Law) Large model + Verification High, but necessary
Research/Development o1-style (internal) Higher, but best quality

Future Developments

Question: Where are the limits of Test-Time Scaling?

The Power Law might eventually reach a saturation point - where more compute no longer helps. Where this point lies is still unclear.

Prediction: The next breakthrough in AI won't come from even larger models, but from smarter test-time algorithms that sensibly decide when and how long a model should think.