CHAPTER 3.4c · REASONING & TEST-TIME COMPUTE

Test-Time Scaling Strategies

Three approaches to improving model performance through additional compute during inference: Parallel, Sequential, and Internal (o1/o3-style)

Test-Time Strategies: Three ways to invest more compute — Parallel (Best-of-N), Sequential (Refinement), Internal (Hidden CoT). Each strategy has different strengths depending on the task.

📖 Learning Context ▼

Distinguish Parallel vs. Sequential vs. Internal
Understand Best-of-N Sampling
Know when which strategy is optimal

Step 3/5 Reasoning & Test-Time Compute

Practical strategies for Test-Time Compute — from concept to implementation.

Best-of-N is easy to implement, Internal (o1-style) is most efficient. The choice of strategy can mean 2-3× cost difference at the same quality.

Parallel: Generate N answers, choose best
Sequential: Iteratively refine one answer
Internal: Thinking Tokens, most efficient method

Accuracy by Test-Time Compute Strategy

Parallel (Best-of-N, Majority Vote)

Sequential (Iterative Refinement)

Internal (o1/o3 Hidden Thinking)

Fig. 1 | Accuracy comparison of the three Test-Time Scaling strategies on mathematical benchmark (AIME 2024). Internal (o3) shows best performance, but also highest latency. Parallel is faster, Sequential offers balance.

Latency vs. Quality Trade-off

Parallel: Fast, good quality

Sequential: Medium, very good quality

Internal: Slow, best quality

Fig. 2 | Latency vs. Accuracy Pareto Frontier. Internal dominates quality but sacrifices latency. Parallel maximizes throughput. Sequential equilibrates.

Criterion	Parallel (Best-of-N)	Sequential (Iterative)	Internal (o1/o3)
Latency (ms)	200-500	800-1500	2000-5000
Throughput (req/s)	2-5	0.7-1.5	0.2-0.5
Accuracy (Math)	65-75%	78-88%	85-94%
Memory Required	N × Base Model High	1.2 × Base Model Moderate	1.1 × Base Model Low
Implementation	Simple	Moderate	Complex
Optimal For	Ensemble + Voting	Step-by-Step Refinement	Complex Reasoning
Example Models	Llama 2, Mistral, Claude	Llama 3.1, GPT-4	o1, o3, DeepSeek R1
Parallelizable?	Yes, fully	Partially (Steps)	Yes (Ensemble)
Cost Efficiency	Good for Latency-SLA	Good for Balance	Best for Quality
Fallback on Error	Other Outputs	Restart with different prompts	Intrinsic Self-Correction

The 3 Strategies in Detail

🔀 Parallel: Best-of-N & Majority Voting

Generate N independent outputs simultaneously. Best method: Top-1 (highest log-likelihood), Second: Majority Voting (when multiple outputs are similar).

Formula: y* = argmax P(y | x) over N Samples
Advantage: Perfectly parallelizable (N GPUs), easy to implement, fast.
Disadvantage: Needs N × Memory, no intrinsic self-correction.
When to use: Large batches, available GPU resources, latency-sensitive.

🔄 Sequential: Iterative Refinement & CoT

Generate output iteratively. First pass: Chain-of-Thought. Second pass: Self-Critique (feedback on reasoning). Third pass: Final Answer.

Formula: y1 → critique(y1) → y2 → ... → y_final
Advantage: Better reasoning quality, less memory than Parallel, errors often self-corrected.
Disadvantage: Slower (iterative), needs multiple forward passes sequentially.
When to use: Medium-complexity tasks, balance between speed and quality desired.

🧠 Internal: o1/o3 Hidden Thinking

Model has internal "Thinking Tokens" (hidden from user) that are generated before the final output. Trained with RL on verifiable rewards.

Formula: hidden_thoughts = model(x, internal=True); y = model(x, hidden_thoughts)
Advantage: Best quality, intrinsic self-correction, user doesn't see failed attempts.
Disadvantage: Proprietary (o3), expensive, requires specialized RL training.
When to use: Very complex tasks (math, code), quality > speed requirements.

⚡

Test-Time Compute = 14× Model Parameters

Snell et al. (2024): Optimal allocation of Test-Time Compute can compensate for a 14× larger model advantage. More thinking time beats larger model.

📊

Parallel is fast, Sequential is smarter

Parallel strategy: 200-500ms latency but needs N GPUs. Sequential: 800-1500ms but only 1.2× memory. Choice depends on infrastructure.

🎯

Internal (o1/o3) dominates quality

o3 achieves 88.9% on AIME 2025 (vs. o1: 92.3%, vs. Standard: <5%). But 2-5s latency makes it impractical for many applications.

🔄

Self-Critique works empirically

Sequential with Self-Critique achieves 80-88% accuracy. The model can often recognize and correct its own errors (if large enough).

💡

Majority Voting needs ~5-10 Samples

At N=5: ~10% accuracy boost. At N=10: ~13% boost. Plateau after ~15 samples (stochasticity limits). CoV (Coefficient of Variation) between outputs shows uncertainty.

⚙️

RL Training makes Internal possible

GRPO Algorithm: RL trains the model to decide itself how much to "think". DeepSeek R1-Zero without SFT: only rule-based rewards + RL = complex reasoning emergent.

Test-Time Scaling Strategies

Learning Objectives

Context: Where are we?

Why it matters

Key Takeaways

The 3 Strategies in Detail