Total Training Progress

Important Milestones

Key Insights

Key Insights

1
Emergence is sudden: DeepSeek R1-Zero showed nearly 0% reasoning during initial training. Then, at iteration ~400k, sudden jump to 20%. This is emergent behavior – not gradual.
2
GRPO ≠ Standard RL: Group Relative Policy Optimization is not PPO. It optimizes relative rankings between groups of solutions. That's why reasoning capabilities emerge without Supervised Fine-Tuning.
3
Reward is verification-based: DeepSeek R1 uses only correct-vs-incorrect result as signal, not step-by-step. Yet the model learns Chain-of-Thought. This is surprising.
4
Base model quality is critical: R1-Zero trains on Qwen-70B-Base (not Qwen-Chat). With Chat-Base, SFT bias would suppress reasoning. Base model is unbiased.
5
Long-chain CoT learns by itself: No procedure enforces long token outputs. The model independently learns that "thinking" (many tokens) yields better accuracy. This is a learning insight.
6
Scaling thinking time is possible: O3 shows that more compute during inference (more tokens for thinking) yields better results. This opens a new scaling axis.