CHAPTER 3.4b · REASONING & TEST-TIME COMPUTE

Test-Time Scaling Curves

How language models achieve better results through longer thinking and more compute

Test-Time Scaling: A new scaling axis — instead of more parameters, simply invest more compute during inference. The curves show: Quality scales log-linearly with thinking tokens.

📖 Learning Context ▼

Test-Time Scaling vs. Pre-Training Scaling
Interpret log-linear scaling curves
Compute efficiency across different strategies

Step 3/5 Reasoning & Test-Time Compute

After Hidden Reasoning: The quantitative perspective on thinking budget.

Test-Time Compute is cheaper than Pre-Training! A 70B model with 10× more inference compute can beat a 400B model. This changes the economics of AI.

Log-linear: Quality increases logarithmically with tokens
Diminishing Returns: At some point more thinking doesn't pay off
Task-dependent: Math benefits more than simple Q&A

Test-Time Compute as Third Scaling Axis

Traditionally, we scale models along two dimensions: Model Size (more parameters) and Data (more training tokens).

But there's a third dimension that's often overlooked: Test-Time Compute - how much computing power we invest during inference (after the model is finished training).

This can happen through various strategies: letting the model think multiple times (Best-of-N), iterative refinement (Chain-of-Thought), or internal reasoning (o1-style).

Surprising Finding (Snell et al., 2024): With optimal test-time allocation, a 7B model can achieve better results than a 70B model without this optimization!

Compute Budget (log scale):

$100

Move the slider to see how the optimal model size changes with budget.

How Does Test-Time Scaling Work?

At constant model size, we get better results when we invest more time/compute during inference:

🔄 Parallel (Best-of-N)

Generate N complete answers in parallel and choose the best one

✓ Easy to implement

✗ Linear latency overhead

📝 Sequential (Iterative)

Generate answer → Check → Refine iteratively

✓ Better quality

✗ High latency

🧠 Internal (o1-style)

Model thinks internally before answering

✓ Hidden, good quality

✗ Costs, less controllable

The Scaling Law Formula

Accuracy as a function of Test-Time Compute approximately follows a Power Law:

Accuracy(c) = a - b × c^(-α) c = Test-Time Compute (e.g., FLOPs or seconds of thinking time) α ≈ 0.5 - 0.8 (depends on model and task) a, b = Constants Example: α = 0.6 Accuracy with 10x Compute is ~1.5% higher

When Is Test-Time Scaling Worth It?

Test-Time Scaling is particularly valuable when:

✓ The problem is not trivial (base success rate > 50%)
✓ Accuracy is critical (errors are expensive)
✓ Latency is not limiting
✓ There's a reward signal to verify the answer

It is not sensible when:

✗ The problem is too easy (model always answers correctly)
✗ Latency is critical (< 100ms needed)
✗ There's no reward signal (how do we know which answer is better?)

🔑 Key Insight: With optimal test-time allocation, you can bring a smaller model to the quality of a 10x larger model - but with the same compute time on training day! The costs just shift from training to inference.

Practical Implications

Scenario	Best Strategy	Cost Comparison
Real-time application (Chat)	Small models, no Test-Time Scaling	Low, but quality limited
Offline Batch Processing	Best-of-N or Sequential	Moderate, but high quality
Critical tasks (Medicine, Law)	Large model + Verification	High, but necessary
Research/Development	o1-style (internal)	Higher, but best quality

Future Developments

Question: Where are the limits of Test-Time Scaling?

The Power Law might eventually reach a saturation point - where more compute no longer helps. Where this point lies is still unclear.

Prediction: The next breakthrough in AI won't come from even larger models, but from smarter test-time algorithms that sensibly decide when and how long a model should think.

Test-Time Scaling Curves

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways