The New Paradigm: Thinking Before Answering

OpenAI's o-series (o1 September 2024, o3 April 2025) represents a paradigm shift in LLM development. These models generate an internal Chain-of-Thought that remains hidden from the user. The model "thinks" before answering, and quality improves with more thinking time.

50%
Fig. 1 | Normal (top): All tokens visible. o1-style (bottom): Thinking processes hidden, only answer visible. The slider controls the internal thinking time.

What is "Hidden Reasoning"?

Error Correction

The model corrects errors internally, without the user seeing inconsistent intermediate steps.

Multiple Approaches

The model explores multiple solution paths internally and selects the best one.

Safety

Thinking output can be filtered without safety concerns – only the final answer is shown.

Test-Time Compute: More Thinking Instead of Size

A key insight from research: Optimal allocation of Test-Time Compute can compensate for a 14× parameter disadvantage. Instead of making the model larger, you can increase inference time and let the model "think more intensively".

Fig. 2 | Test-Time Compute Scaling: How thinking time (sequential scaling) improves quality without enlarging the model. Three approaches: Parallel (multiple outputs), Sequential (iterative), Internal (o1-style).
Three Approaches to Test-Time Scaling
1. Parallel: Generate N outputs, select best
2. Sequential: Iterative refinement
3. Internal (o1): Model decides allocation

Benchmark Results: The Performance Leap

The o-series shows dramatic improvements on difficult reasoning benchmarks that previous models couldn't solve:

Fig. 3 | Performance comparison: o3 shows impressive results on Frontier Math (25.2% vs <2% for previous models), AIME (88.9%) and SWE-Bench (69.1%).
Benchmark Description o3 Result Context
AIME 2025 American Inv. Math Exam 88.9% Olympiad-Level Mathematics
SWE-Bench Software Engineering 69.1% Real-world Code Changes
Frontier Math Research Mathematics 25.2% Previously: <2% for all models

Explicit vs. Hidden CoT

How do the two approaches differ in practice?

Fig. 4 | Left: Explicit CoT shows all thinking processes, errors are visible. Right: o1-style hides thinking, shows only clean answer.

🔍 Explicit CoT

Prompting technique, user sees all errors and detours. Works better with larger models.

🧠 Hidden Reasoning

RL-trained, internal error correction, clean output. Paradigm shift to Test-Time Compute.

⚖️ Trade-off

Hidden: more expensive inference. Explicit: more transparent. Choice depends on use-case.

How o1/o3 Thinks Internally: The RL Training Loop

o1/o3 are not trained through Supervised Fine-Tuning. Instead, they use Reinforcement Learning with verifiable rewards. The model learns through trial-and-error:

RL Training Process
1. Model generates internal reasoning tokens
2. Explores many thought paths
3. Verifiable Rewards: Correct? ✓ Code runs? ✓
4. RL penalizes wrong paths, rewards correct ones
5. Model learns to use thinking time efficiently
Fig. 5 | The RL Loop: The model generates reasoning tokens, receives feedback (verifiable rewards), and optimizes its behavior. This happens completely internally – the user only sees the final answer.

Key Insights from Training

Verifiable Rewards

Mathematical correctness, code execution, formal verification – rewards only for objectively verifiable outputs.

No Supervised Demos

No manual annotation of thinking processes. The model discovers reasoning spontaneously through RL.

Error Correction

The model learns to recognize and correct its own errors – all internally, before answering.

Thinking Time ≠ Model Size

Test-Time Compute can compensate for larger models. Efficiency paradigm shift.

New Scaling Laws: Training + Inference

Traditionally, LLMs follow the Chinchilla Scaling Law: Model Size × Training Data. With o1/o3, a new dimension is added: Test-Time Compute.

Fig. 6 | 3D Scaling: Instead of just increasing model size and training data, you can also increase Test-Time Compute. New architecture for efficiency trade-offs.

Implications for the Future

Limitations and Future Questions

Current Limitations

🔍 Black Box

User cannot see how the model thinks. Debugging errors is difficult.

💰 Expensive

More thinking time = higher inference costs. ROI calculation needed for each use-case.

✓ Verifiability

Works best for problems with objective answers (math, code). Weaker for open-ended tasks.

Future Directions (Q4 2025–2026)

Vision for Next Frontier
"The next frontier lies in the integration of these advances: Reasoning models with unlimited context, efficient through MoE and Quantization, aligned through scalable AI feedback methods."

Key Insights

1️⃣ Paradigm Shift

No longer: Bigger = better. New: Thinking time = better. RL training instead of just Supervised Fine-Tuning.

2️⃣ Verifiable Rewards

RL with objective rewards (correctness) enables spontaneous reasoning without manual annotation.

3️⃣ Test-Time Compute

Thinking time compensates up to 14× parameters. New efficiency trade-offs for deployment.

4️⃣ Performance Leap

AIME 88.9%, Frontier Math 25.2% (from <2%) – qualitative leap in reasoning capabilities.

5️⃣ Trade-offs

Hidden Reasoning: Expensive, black-box, but clean output. Explicit CoT: Cheaper, transparent, error-prone.

6️⃣ Multi-Modal Future

Integration with context, MoE, multi-domain reasoning. But transparency questions remain open.