Accuracy by Test-Time Compute Strategy
Parallel (Best-of-N, Majority Vote)
Sequential (Iterative Refinement)
Internal (o1/o3 Hidden Thinking)
Fig. 1 | Accuracy comparison of the three Test-Time Scaling strategies on mathematical benchmark (AIME 2024). Internal (o3) shows best performance, but also highest latency. Parallel is faster, Sequential offers balance.
Latency vs. Quality Trade-off
Parallel: Fast, good quality
Sequential: Medium, very good quality
Internal: Slow, best quality
Fig. 2 | Latency vs. Accuracy Pareto Frontier. Internal dominates quality but sacrifices latency. Parallel maximizes throughput. Sequential equilibrates.
Criterion Parallel (Best-of-N) Sequential (Iterative) Internal (o1/o3)
Latency (ms) 200-500 800-1500 2000-5000
Throughput (req/s) 2-5 0.7-1.5 0.2-0.5
Accuracy (Math) 65-75% 78-88% 85-94%
Memory Required N × Base Model High 1.2 × Base Model Moderate 1.1 × Base Model Low
Implementation Simple Moderate Complex
Optimal For Ensemble + Voting Step-by-Step Refinement Complex Reasoning
Example Models Llama 2, Mistral, Claude Llama 3.1, GPT-4 o1, o3, DeepSeek R1
Parallelizable? Yes, fully Partially (Steps) Yes (Ensemble)
Cost Efficiency Good for Latency-SLA Good for Balance Best for Quality
Fallback on Error Other Outputs Restart with different prompts Intrinsic Self-Correction

The 3 Strategies in Detail

🔀 Parallel: Best-of-N & Majority Voting
Generate N independent outputs simultaneously. Best method: Top-1 (highest log-likelihood), Second: Majority Voting (when multiple outputs are similar).

Formula: y* = argmax P(y | x) over N Samples
Advantage: Perfectly parallelizable (N GPUs), easy to implement, fast.
Disadvantage: Needs N × Memory, no intrinsic self-correction.
When to use: Large batches, available GPU resources, latency-sensitive.
🔄 Sequential: Iterative Refinement & CoT
Generate output iteratively. First pass: Chain-of-Thought. Second pass: Self-Critique (feedback on reasoning). Third pass: Final Answer.

Formula: y1 → critique(y1) → y2 → ... → y_final
Advantage: Better reasoning quality, less memory than Parallel, errors often self-corrected.
Disadvantage: Slower (iterative), needs multiple forward passes sequentially.
When to use: Medium-complexity tasks, balance between speed and quality desired.
🧠 Internal: o1/o3 Hidden Thinking
Model has internal "Thinking Tokens" (hidden from user) that are generated before the final output. Trained with RL on verifiable rewards.

Formula: hidden_thoughts = model(x, internal=True); y = model(x, hidden_thoughts)
Advantage: Best quality, intrinsic self-correction, user doesn't see failed attempts.
Disadvantage: Proprietary (o3), expensive, requires specialized RL training.
When to use: Very complex tasks (math, code), quality > speed requirements.
Test-Time Compute = 14× Model Parameters
Snell et al. (2024): Optimal allocation of Test-Time Compute can compensate for a 14× larger model advantage. More thinking time beats larger model.
📊
Parallel is fast, Sequential is smarter
Parallel strategy: 200-500ms latency but needs N GPUs. Sequential: 800-1500ms but only 1.2× memory. Choice depends on infrastructure.
🎯
Internal (o1/o3) dominates quality
o3 achieves 88.9% on AIME 2025 (vs. o1: 92.3%, vs. Standard: <5%). But 2-5s latency makes it impractical for many applications.
🔄
Self-Critique works empirically
Sequential with Self-Critique achieves 80-88% accuracy. The model can often recognize and correct its own errors (if large enough).
💡
Majority Voting needs ~5-10 Samples
At N=5: ~10% accuracy boost. At N=10: ~13% boost. Plateau after ~15 samples (stochasticity limits). CoV (Coefficient of Variation) between outputs shows uncertainty.
⚙️
RL Training makes Internal possible
GRPO Algorithm: RL trains the model to decide itself how much to "think". DeepSeek R1-Zero without SFT: only rule-based rewards + RL = complex reasoning emergent.