The R1-Zero Breakthrough Experiment

DeepSeek R1-Zero is remarkable: It is a base model (without Supervised Fine-Tuning) that was trained only through Reinforcement Learning with rule-based rewards. The result: The model develops spontaneous Chain-of-Thought reasoning, self-verification, and reflection capabilities.

Key findings:

  • No manual reasoning examples needed: R1-Zero was NOT trained with exemplary CoT outputs
  • Emergence from RL: Only the goal (correct result) and feedback (reward) were necessary
  • Practical rewards: Math: right/wrong. Code: runs/crashes. Format errors: penalty
  • Dramatic improvement: AIME 2024 from 15.6% (base) to 71.0% (after RL)
Fig. 1 | AIME 2024 Performance: Base (green) 15.6%, R1-Zero after RL (purple) 71.0%. A 4.5× jump through pure Reinforcement Learning without Supervised Fine-Tuning.
Start

GRPO – Group Relative Policy Optimization

The key to DeepSeek R1's training efficiency is the GRPO (Group Relative Policy Optimization) algorithm, a simplification of PPO (Proximal Policy Optimization).

The core problem: PPO requires a separate Value Network (Critic) for stability. This doubles memory requirements and training time.

GRPO solution: Instead of a Critic network, generate a group (e.g., G=8) of outputs per prompt. The advantage is calculated relative to the group (z-score normalization), not absolutely.

GRPO Advantage:

Advantage_i = (reward_i - mean(rewards)) / std(rewards)

Loss = -min(A_i × r_t, A_i × clip(r_t, 1-ε, 1+ε))

Benefits of GRPO:

  • No Value Network needed → 50% memory savings
  • More stable convergence through normalization
  • Simpler implementation
  • Better scaling to large models (671B DeepSeek R1)
Fig. 2 | Architecture comparison: PPO (top) with Policy and Value Networks, GRPO (bottom) with only Policy and group-based normalization. GRPO is more memory-efficient and stable.

Emergence of Reasoning During Training

The most remarkable thing about the R1-Zero experiment is the emergence of reasoning structures without the model ever being trained with exemplary Chain-of-Thought outputs.

Training phases:

  • Phase 0 (Early RL): The model generates random outputs, sometimes correct, often wrong
  • Phase 1 (~5B Steps): Spontaneously, the model starts generating "Thinking" blocks with hypotheses and checks
  • Phase 2 (~15B Steps): CoT format becomes consistent. Self-verification emerges ("Wait, that's wrong, let me recalculate")
  • Phase 3 (Final): Reflective reasoning, multiple verifications, robust error handling
Fig. 3 | Emergence timeline: Training progress from left (chaotic) to right (structured CoT). Output length increases, structure becomes consistent, accuracy jumps after Phase 1.

Entropy Collapse Problem

During RL training, a critical problem can occur: Entropy Collapse. The model's output distribution becomes too narrow (low entropy), leading to repetitive answers and poor generalization.

⚠️

Problem: Distribution becomes too narrow

Symptom: The model converges on few, repetitive answer patterns.
Result: Performance stagnation despite continued training.
Cause: Reward signal incentivizes too strongly individual "safe" outputs.

Solution: Entropy Monitoring & Schedule Adjustment

Skywork-OR1 Paper (arxiv:2505.22312): Monitor entropy during training.
On entropy drop: Adjust RL schedule (Learning Rate, Reward Clipping).
Result: DeepSeek-R1-Distill-32B: +15.0% performance (57.8% → 72.8%)

Example: R1-Zero Output on Mathematical Problem

Problem: "What is 47 × 35?"

Base output (no RL):

# Wrong or too short 47 × 35 = 1645

R1-Zero output (after RL):

<thinking> I need to calculate 47 × 35.

Let me use the standard multiplication algorithm.

47 × 35
= 47 × (30 + 5)
= 47 × 30 + 47 × 5
= 1410 + 235
= 1645

Let me verify: 47 × 35
47 × 30 = 1410 ✓
47 × 5 = 235 ✓
1410 + 235 = 1645 ✓
</thinking>

The answer is 1645.

Observations:

  • Spontaneous structure: Thinking blocks were never trained, emerge by themselves
  • Multiple verification: The model calculates and verifies multiple times
  • Step breakdown: Decomposition into smaller steps for traceability

Key Insights

🧠 Emergence is Real

Chain-of-Thought reasoning emerges spontaneously from RL, without needing to show examples.

💡 RL > SFT for Reasoning

Reinforcement Learning is more efficient for reasoning than Supervised Fine-Tuning with examples.

⚙️ GRPO Efficiency

Group-based normalization is more stable and memory-efficient than Critic networks.

📊 Verifiable Rewards

Rule-based rewards (right/wrong, runs/crashes) are practical and scalable.

🚀 Paradigm Shift

Test-Time Compute (more thinking) can be just as important as model size. New scaling axis.

🎯 Practical Impact

AIME went from 15.6% → 71.0%. This is an example of scale-over-compute instead of scale-over-size.

Cognitive Behaviors: Why Do Some Models Learn Faster?

Research shows: Models with higher "Exploration Tendency" learn +40% faster under RL. Four identified Cognitive Behaviors are decisive:

🔍

1. Exploration

Trying diverse solution approaches instead of early convergence. Models with higher exploration find better strategies and learn faster under RL.

2. Verification

Self-checking intermediate steps. Verification correlates with +35% final performance — models that validate intermediate steps make fewer errors.

♻️

3. Refinement

Iterative improvement based on feedback. Refinement behavior enables faster adaptation to new reward signals during training.

🎯

4. Adaptation

Adapting strategy for different task types. Critical for generalization across domains — prevents over-specialization on specific task types.

Paper: arxiv:2503.01307 (March 2025)
Key Finding: These behaviors are emergent — they arise during training and are not explicitly programmed. Models with strong Exploration + Verification tendency benefit most from RL-based reasoning optimizations.