How reasoning capabilities emerge spontaneously during RL training
Emergent Reasoning Capabilities arise spontaneously during RL training – not through explicit programming. This animation shows how DeepSeek R1 jumps from 0% to 90%+ reasoning performance within a few iterations.
After Scaling & Complexity (1/2), we show emergent capabilities (2/2) – what happens when models grow larger.
Emergence is the most surprising phenomenon in LLMs: capabilities that weren't explicitly trained suddenly appear. This defines the limits of what we can predict.
DeepSeek R1 shows reasoning not through explicit programming, but spontaneously during RL training. At epoch 5-7: output length explodes, thinking becomes activated.
Group Relative Policy Optimization allows the model to explore different strategies. With SFT alone: no reasoning. With GRPO: emergence after 1-2 weeks of training.
Phase 1 (SFT): 1-2 days on 8×H100. Phase 2 (GRPO): 7-10 days. Phase 3-4: continuous improvement. Total: ~2 weeks from start to SOTA performance.
SFT: max 300 tokens. After Phase 2: 500-1K. After emergence: 2K-10K! The model learns: "For hard problems, think longer."
Humans: simple tasks solved quickly, complex problems require longer thinking. DeepSeek R1: allows variable output length per task difficulty → human-like reasoning.
Thinking = Compute at test time. More tokens = better solutions. DeepSeek O1: 10K-100K+ thinking tokens. Next generation: flexible compute allocation.