Performance During Training

Phase 1: SFT
Output Length: ~300 Tokens
Acc: ~60%
Phase 2: GRPO
Group Relative Policy Optimization
Output Length: ~500 Tokens
Acc: ~65%
Phase 3: CoT Emergence
Chain-of-Thought reasoning spontaneous
Output Length: ~2000-5000 Tokens
Acc: ~72-78%
Phase 4: Verification
Self-verification and refinement
Output Length: ~5000-10000 Tokens
Acc: ~82-85%

Spontaneous Emergence

DeepSeek R1 shows reasoning not through explicit programming, but spontaneously during RL training. At epoch 5-7: output length explodes, thinking becomes activated.

GRPO is Critical

Group Relative Policy Optimization allows the model to explore different strategies. With SFT alone: no reasoning. With GRPO: emergence after 1-2 weeks of training.

Training Costs

Phase 1 (SFT): 1-2 days on 8×H100. Phase 2 (GRPO): 7-10 days. Phase 3-4: continuous improvement. Total: ~2 weeks from start to SOTA performance.

Output Length Explosion

SFT: max 300 tokens. After Phase 2: 500-1K. After emergence: 2K-10K! The model learns: "For hard problems, think longer."

Similar to Humans

Humans: simple tasks solved quickly, complex problems require longer thinking. DeepSeek R1: allows variable output length per task difficulty → human-like reasoning.

Future: Test-Time Scaling

Thinking = Compute at test time. More tokens = better solutions. DeepSeek O1: 10K-100K+ thinking tokens. Next generation: flexible compute allocation.