Emergence Timeline – DeepSeek R1

Emergence Timeline: DeepSeek R1

How reasoning capabilities suddenly emerge during GRPO training – from 0% to 90%+ in just a few iterations

DeepSeek R1's Emergence impressively demonstrates how reasoning capabilities arise during GRPO training. From initial incompetence to complex multi-step thinking – and without explicit Chain-of-Thought training.

📖 Learning Context ▼

Understand GRPO as an RL method for reasoning
Follow the phases of emergence
Contextualize DeepSeek R1's architecture

Step 2/2 Trends & Future

Deep dive into emergent capabilities (2/2) using DeepSeek R1 as an example.

DeepSeek R1 (January 2025) shows that open-source models can compete with o1. The methodology – GRPO instead of RLHF – is a paradigm shift for reasoning models.

GRPO: Group Relative Policy Optimization – simpler than PPO
Cold Start: Begins without CoT, develops it on its own
Open Source: Fully reproducible and transparent

Key Insights

Emergence is sudden: DeepSeek R1-Zero showed nearly 0% reasoning during initial training. Then, at iteration ~400k, sudden jump to 20%. This is emergent behavior – not gradual.

GRPO ≠ Standard RL: Group Relative Policy Optimization is not PPO. It optimizes relative rankings between groups of solutions. That's why reasoning capabilities emerge without Supervised Fine-Tuning.

Reward is verification-based: DeepSeek R1 uses only correct-vs-incorrect result as signal, not step-by-step. Yet the model learns Chain-of-Thought. This is surprising.

Base model quality is critical: R1-Zero trains on Qwen-70B-Base (not Qwen-Chat). With Chat-Base, SFT bias would suppress reasoning. Base model is unbiased.

Long-chain CoT learns by itself: No procedure enforces long token outputs. The model independently learns that "thinking" (many tokens) yields better accuracy. This is a learning insight.

Scaling thinking time is possible: O3 shows that more compute during inference (more tokens for thinking) yields better results. This opens a new scaling axis.

Emergence Timeline: DeepSeek R1

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways

Important Milestones

Key Insights