How modern reasoning models use internal thinking processes to solve complex problems – with thinking time instead of model size
Hidden Reasoning in o1/o3: OpenAI's models generate internal "Thinking Tokens" that are not shown to the user. This hidden chain can span hundreds of steps — Test-Time Compute as a new scaling axis.
After CoT basics: How commercial models industrialize reasoning.
o1 achieves 83% on AIME Math (vs. 13% for GPT-4). The key: Thinking Tokens generated during inference. Compute time becomes the new resource.
OpenAI's o-series (o1 September 2024, o3 April 2025) represents a paradigm shift in LLM development. These models generate an internal Chain-of-Thought that remains hidden from the user. The model "thinks" before answering, and quality improves with more thinking time.
The model corrects errors internally, without the user seeing inconsistent intermediate steps.
The model explores multiple solution paths internally and selects the best one.
Thinking output can be filtered without safety concerns – only the final answer is shown.
A key insight from research: Optimal allocation of Test-Time Compute can compensate for a 14× parameter disadvantage. Instead of making the model larger, you can increase inference time and let the model "think more intensively".
The o-series shows dramatic improvements on difficult reasoning benchmarks that previous models couldn't solve:
| Benchmark | Description | o3 Result | Context |
|---|---|---|---|
| AIME 2025 | American Inv. Math Exam | 88.9% | Olympiad-Level Mathematics |
| SWE-Bench | Software Engineering | 69.1% | Real-world Code Changes |
| Frontier Math | Research Mathematics | 25.2% | Previously: <2% for all models |
How do the two approaches differ in practice?
Prompting technique, user sees all errors and detours. Works better with larger models.
RL-trained, internal error correction, clean output. Paradigm shift to Test-Time Compute.
Hidden: more expensive inference. Explicit: more transparent. Choice depends on use-case.
o1/o3 are not trained through Supervised Fine-Tuning. Instead, they use Reinforcement Learning with verifiable rewards. The model learns through trial-and-error:
Mathematical correctness, code execution, formal verification – rewards only for objectively verifiable outputs.
No manual annotation of thinking processes. The model discovers reasoning spontaneously through RL.
The model learns to recognize and correct its own errors – all internally, before answering.
Test-Time Compute can compensate for larger models. Efficiency paradigm shift.
Traditionally, LLMs follow the Chinchilla Scaling Law: Model Size × Training Data. With o1/o3, a new dimension is added: Test-Time Compute.
User cannot see how the model thinks. Debugging errors is difficult.
More thinking time = higher inference costs. ROI calculation needed for each use-case.
Works best for problems with objective answers (math, code). Weaker for open-ended tasks.
No longer: Bigger = better. New: Thinking time = better. RL training instead of just Supervised Fine-Tuning.
RL with objective rewards (correctness) enables spontaneous reasoning without manual annotation.
Thinking time compensates up to 14× parameters. New efficiency trade-offs for deployment.
AIME 88.9%, Frontier Math 25.2% (from <2%) – qualitative leap in reasoning capabilities.
Hidden Reasoning: Expensive, black-box, but clean output. Explicit CoT: Cheaper, transparent, error-prone.
Integration with context, MoE, multi-domain reasoning. But transparency questions remain open.