Three approaches to improving model performance through additional compute during inference: Parallel, Sequential, and Internal (o1/o3-style)
Test-Time Strategies: Three ways to invest more compute — Parallel (Best-of-N), Sequential (Refinement), Internal (Hidden CoT). Each strategy has different strengths depending on the task.
Practical strategies for Test-Time Compute — from concept to implementation.
Best-of-N is easy to implement, Internal (o1-style) is most efficient. The choice of strategy can mean 2-3× cost difference at the same quality.
| Criterion | Parallel (Best-of-N) | Sequential (Iterative) | Internal (o1/o3) |
|---|---|---|---|
| Latency (ms) | 200-500 | 800-1500 | 2000-5000 |
| Throughput (req/s) | 2-5 | 0.7-1.5 | 0.2-0.5 |
| Accuracy (Math) | 65-75% | 78-88% | 85-94% |
| Memory Required | N × Base Model High | 1.2 × Base Model Moderate | 1.1 × Base Model Low |
| Implementation | Simple | Moderate | Complex |
| Optimal For | Ensemble + Voting | Step-by-Step Refinement | Complex Reasoning |
| Example Models | Llama 2, Mistral, Claude | Llama 3.1, GPT-4 | o1, o3, DeepSeek R1 |
| Parallelizable? | Yes, fully | Partially (Steps) | Yes (Ensemble) |
| Cost Efficiency | Good for Latency-SLA | Good for Balance | Best for Quality |
| Fallback on Error | Other Outputs | Restart with different prompts | Intrinsic Self-Correction |