CHAPTER 3.2 · REASONING

o1/o3: Hidden Reasoning

How modern reasoning models use internal thinking processes to solve complex problems – with thinking time instead of model size

Hidden Reasoning in o1/o3: OpenAI's models generate internal "Thinking Tokens" that are not shown to the user. This hidden chain can span hundreds of steps — Test-Time Compute as a new scaling axis.

📖 Learning Context ▼

Distinguish hidden vs. visible reasoning
Understand summarization of thinking process
Trade-off: More thinking = higher costs

Step 2/5 Reasoning & Test-Time Compute

After CoT basics: How commercial models industrialize reasoning.

o1 achieves 83% on AIME Math (vs. 13% for GPT-4). The key: Thinking Tokens generated during inference. Compute time becomes the new resource.

Thinking Tokens: Internal reasoning chain, not visible
Summarization: Only the result is shown
Scaling Axis: More compute at inference = better results

The New Paradigm: Thinking Before Answering

OpenAI's o-series (o1 September 2024, o3 April 2025) represents a paradigm shift in LLM development. These models generate an internal Chain-of-Thought that remains hidden from the user. The model "thinks" before answering, and quality improves with more thinking time.

Thinking Time Budget: 50%

Fig. 1 | Normal (top): All tokens visible. o1-style (bottom): Thinking processes hidden, only answer visible. The slider controls the internal thinking time.

What is "Hidden Reasoning"?

Explicit CoT: Prompting technique where the model makes its reasoning visible (e.g., "Let's think step by step")
Hidden Reasoning: RL-trained ability to think internally without showing thoughts to the user

Error Correction

The model corrects errors internally, without the user seeing inconsistent intermediate steps.

Multiple Approaches

The model explores multiple solution paths internally and selects the best one.

Safety

Thinking output can be filtered without safety concerns – only the final answer is shown.

Test-Time Compute: More Thinking Instead of Size

A key insight from research: Optimal allocation of Test-Time Compute can compensate for a 14× parameter disadvantage. Instead of making the model larger, you can increase inference time and let the model "think more intensively".

Fig. 2 | Test-Time Compute Scaling: How thinking time (sequential scaling) improves quality without enlarging the model. Three approaches: Parallel (multiple outputs), Sequential (iterative), Internal (o1-style).

Three Approaches to Test-Time Scaling

1. Parallel: Generate N outputs, select best
2. Sequential: Iterative refinement
3. Internal (o1): Model decides allocation

Benchmark Results: The Performance Leap

The o-series shows dramatic improvements on difficult reasoning benchmarks that previous models couldn't solve:

Display:

Fig. 3 | Performance comparison: o3 shows impressive results on Frontier Math (25.2% vs <2% for previous models), AIME (88.9%) and SWE-Bench (69.1%).

Benchmark	Description	o3 Result	Context
AIME 2025	American Inv. Math Exam	88.9%	Olympiad-Level Mathematics
SWE-Bench	Software Engineering	69.1%	Real-world Code Changes
Frontier Math	Research Mathematics	25.2%	Previously: <2% for all models

Explicit vs. Hidden CoT

How do the two approaches differ in practice?

Fig. 4 | Left: Explicit CoT shows all thinking processes, errors are visible. Right: o1-style hides thinking, shows only clean answer.

🔍 Explicit CoT

Prompting technique, user sees all errors and detours. Works better with larger models.

🧠 Hidden Reasoning

RL-trained, internal error correction, clean output. Paradigm shift to Test-Time Compute.

⚖️ Trade-off

Hidden: more expensive inference. Explicit: more transparent. Choice depends on use-case.

How o1/o3 Thinks Internally: The RL Training Loop

o1/o3 are not trained through Supervised Fine-Tuning. Instead, they use Reinforcement Learning with verifiable rewards. The model learns through trial-and-error:

RL Training Process

1. Model generates internal reasoning tokens
2. Explores many thought paths
3. Verifiable Rewards: Correct? ✓ Code runs? ✓
4. RL penalizes wrong paths, rewards correct ones
5. Model learns to use thinking time efficiently

Fig. 5 | The RL Loop: The model generates reasoning tokens, receives feedback (verifiable rewards), and optimizes its behavior. This happens completely internally – the user only sees the final answer.

Key Insights from Training

Verifiable Rewards

Mathematical correctness, code execution, formal verification – rewards only for objectively verifiable outputs.

No Supervised Demos

No manual annotation of thinking processes. The model discovers reasoning spontaneously through RL.

Error Correction

The model learns to recognize and correct its own errors – all internally, before answering.

Thinking Time ≠ Model Size

Test-Time Compute can compensate for larger models. Efficiency paradigm shift.

New Scaling Laws: Training + Inference

Traditionally, LLMs follow the Chinchilla Scaling Law: Model Size × Training Data. With o1/o3, a new dimension is added: Test-Time Compute.

Fig. 6 | 3D Scaling: Instead of just increasing model size and training data, you can also increase Test-Time Compute. New architecture for efficiency trade-offs.

Implications for the Future

Cost Optimization: Small + lots of thinking time can be better than large + little time
Inference Costs: "Thinking Tokens" are more expensive – trade-off with quality
Model Selection: Don't always upgrade to larger model – increase thinking time
Training Paradigm: RL becomes more central (instead of just Supervised Fine-Tuning)

Limitations and Future Questions

Current Limitations

🔍 Black Box

User cannot see how the model thinks. Debugging errors is difficult.

💰 Expensive

More thinking time = higher inference costs. ROI calculation needed for each use-case.

✓ Verifiability

Works best for problems with objective answers (math, code). Weaker for open-ended tasks.

Future Directions (Q4 2025–2026)

Reasoning + Context: Combination with unlimited context for long documents
MoE Integration: Efficient Reasoning through Mixture-of-Experts
Multi-Domain: Beyond Math/Code – also NLP, reasoning for other fields
Transparency: Selective disclosure of thinking processes?

Vision for Next Frontier

"The next frontier lies in the integration of these advances: Reasoning models with unlimited context, efficient through MoE and Quantization, aligned through scalable AI feedback methods."

Key Insights

1️⃣ Paradigm Shift

No longer: Bigger = better. New: Thinking time = better. RL training instead of just Supervised Fine-Tuning.

2️⃣ Verifiable Rewards

RL with objective rewards (correctness) enables spontaneous reasoning without manual annotation.

3️⃣ Test-Time Compute

Thinking time compensates up to 14× parameters. New efficiency trade-offs for deployment.

4️⃣ Performance Leap

AIME 88.9%, Frontier Math 25.2% (from <2%) – qualitative leap in reasoning capabilities.

5️⃣ Trade-offs

Hidden Reasoning: Expensive, black-box, but clean output. Explicit CoT: Cheaper, transparent, error-prone.

6️⃣ Multi-Modal Future

Integration with context, MoE, multi-domain reasoning. But transparency questions remain open.

o1/o3: Hidden Reasoning

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways

The New Paradigm: Thinking Before Answering

What is "Hidden Reasoning"?

Error Correction

Multiple Approaches

Safety

Test-Time Compute: More Thinking Instead of Size

Benchmark Results: The Performance Leap

Explicit vs. Hidden CoT

🔍 Explicit CoT

🧠 Hidden Reasoning

⚖️ Trade-off

How o1/o3 Thinks Internally: The RL Training Loop

Key Insights from Training

Verifiable Rewards

No Supervised Demos

Error Correction

Thinking Time ≠ Model Size

New Scaling Laws: Training + Inference

Implications for the Future

Limitations and Future Questions

Current Limitations

🔍 Black Box

💰 Expensive

✓ Verifiability

Future Directions (Q4 2025–2026)

Key Insights

1️⃣ Paradigm Shift

2️⃣ Verifiable Rewards

3️⃣ Test-Time Compute

4️⃣ Performance Leap

5️⃣ Trade-offs

6️⃣ Multi-Modal Future