How reasoning capabilities suddenly emerge during GRPO training – from 0% to 90%+ in just a few iterations
DeepSeek R1's Emergence impressively demonstrates how reasoning capabilities arise during GRPO training. From initial incompetence to complex multi-step thinking – and without explicit Chain-of-Thought training.
Deep dive into emergent capabilities (2/2) using DeepSeek R1 as an example.
DeepSeek R1 (January 2025) shows that open-source models can compete with o1. The methodology – GRPO instead of RLHF – is a paradigm shift for reasoning models.