DeepSeek R1-Zero is remarkable: It is a base model (without Supervised Fine-Tuning) that was trained only through Reinforcement Learning with rule-based rewards. The result: The model develops spontaneous Chain-of-Thought reasoning, self-verification, and reflection capabilities.
Key findings:
- No manual reasoning examples needed: R1-Zero was NOT trained with exemplary CoT outputs
- Emergence from RL: Only the goal (correct result) and feedback (reward) were necessary
- Practical rewards: Math: right/wrong. Code: runs/crashes. Format errors: penalty
- Dramatic improvement: AIME 2024 from 15.6% (base) to 71.0% (after RL)