CHAPTER 3.3 · REASONING & TEST-TIME COMPUTE

DeepSeek R1 & GRPO

How Reinforcement Learning makes reasoning emerge: The R1-Zero experiment and the GRPO algorithm

DeepSeek R1 & GRPO: Reasoning can emerge through pure RL — without explicit training on CoT data. The GRPO algorithm (Group Relative Policy Optimization) makes this possible with minimal reward signal.

📖 Learning Context ▼

Understand RL-based reasoning
Know the differences between GRPO vs. PPO
Follow the R1-Zero experiment

Step 2/5 Reasoning & Test-Time Compute

Complements Hidden Reasoning with the scientific approach from DeepSeek.

R1-Zero shows: A model can discover reasoning through RL without seeing CoT examples. The model independently develops strategies like reflection and backtracking.

Emergence: CoT emerges through RL without examples
GRPO: Group-based policy optimization, more efficient than PPO
Aha-Moment: Model discovers reasoning strategies itself

The R1-Zero Breakthrough Experiment

DeepSeek R1-Zero is remarkable: It is a base model (without Supervised Fine-Tuning) that was trained only through Reinforcement Learning with rule-based rewards. The result: The model develops spontaneous Chain-of-Thought reasoning, self-verification, and reflection capabilities.

Key findings:

No manual reasoning examples needed: R1-Zero was NOT trained with exemplary CoT outputs
Emergence from RL: Only the goal (correct result) and feedback (reward) were necessary
Practical rewards: Math: right/wrong. Code: runs/crashes. Format errors: penalty
Dramatic improvement: AIME 2024 from 15.6% (base) to 71.0% (after RL)

Fig. 1 | AIME 2024 Performance: Base (green) 15.6%, R1-Zero after RL (purple) 71.0%. A 4.5× jump through pure Reinforcement Learning without Supervised Fine-Tuning.

Training Progress: Start

Example:

GRPO – Group Relative Policy Optimization

The key to DeepSeek R1's training efficiency is the GRPO (Group Relative Policy Optimization) algorithm, a simplification of PPO (Proximal Policy Optimization).

The core problem: PPO requires a separate Value Network (Critic) for stability. This doubles memory requirements and training time.

GRPO solution: Instead of a Critic network, generate a group (e.g., G=8) of outputs per prompt. The advantage is calculated relative to the group (z-score normalization), not absolutely.

GRPO Advantage:

Advantage_i = (reward_i - mean(rewards)) / std(rewards)

Loss = -min(A_i × r_t, A_i × clip(r_t, 1-ε, 1+ε))

Benefits of GRPO:

No Value Network needed → 50% memory savings
More stable convergence through normalization
Simpler implementation
Better scaling to large models (671B DeepSeek R1)

Fig. 2 | Architecture comparison: PPO (top) with Policy and Value Networks, GRPO (bottom) with only Policy and group-based normalization. GRPO is more memory-efficient and stable.

Emergence of Reasoning During Training

The most remarkable thing about the R1-Zero experiment is the emergence of reasoning structures without the model ever being trained with exemplary Chain-of-Thought outputs.

Training phases:

Phase 0 (Early RL): The model generates random outputs, sometimes correct, often wrong
Phase 1 (~5B Steps): Spontaneously, the model starts generating "Thinking" blocks with hypotheses and checks
Phase 2 (~15B Steps): CoT format becomes consistent. Self-verification emerges ("Wait, that's wrong, let me recalculate")
Phase 3 (Final): Reflective reasoning, multiple verifications, robust error handling

Fig. 3 | Emergence timeline: Training progress from left (chaotic) to right (structured CoT). Output length increases, structure becomes consistent, accuracy jumps after Phase 1.

Entropy Collapse Problem

During RL training, a critical problem can occur: Entropy Collapse. The model's output distribution becomes too narrow (low entropy), leading to repetitive answers and poor generalization.

⚠️

Problem: Distribution becomes too narrow

Symptom: The model converges on few, repetitive answer patterns.
Result: Performance stagnation despite continued training.
Cause: Reward signal incentivizes too strongly individual "safe" outputs.

✅

Solution: Entropy Monitoring & Schedule Adjustment

Skywork-OR1 Paper (arxiv:2505.22312): Monitor entropy during training.
On entropy drop: Adjust RL schedule (Learning Rate, Reward Clipping).
Result: DeepSeek-R1-Distill-32B: +15.0% performance (57.8% → 72.8%)

Example: R1-Zero Output on Mathematical Problem

Problem: "What is 47 × 35?"

Base output (no RL):

# Wrong or too short
47 × 35 = 1645
            

R1-Zero output (after RL):

<thinking>
I need to calculate 47 × 35.

Let me use the standard multiplication algorithm.

47 × 35

= 47 × (30 + 5)

= 47 × 30 + 47 × 5

= 1410 + 235

= 1645

Let me verify: 47 × 35

47 × 30 = 1410 ✓

47 × 5 = 235 ✓

1410 + 235 = 1645 ✓

</thinking>

The answer is 1645.

Observations:

Spontaneous structure: Thinking blocks were never trained, emerge by themselves
Multiple verification: The model calculates and verifies multiple times
Step breakdown: Decomposition into smaller steps for traceability

Key Insights

🧠 Emergence is Real

Chain-of-Thought reasoning emerges spontaneously from RL, without needing to show examples.

💡 RL > SFT for Reasoning

Reinforcement Learning is more efficient for reasoning than Supervised Fine-Tuning with examples.

⚙️ GRPO Efficiency

Group-based normalization is more stable and memory-efficient than Critic networks.

📊 Verifiable Rewards

Rule-based rewards (right/wrong, runs/crashes) are practical and scalable.

🚀 Paradigm Shift

Test-Time Compute (more thinking) can be just as important as model size. New scaling axis.

🎯 Practical Impact

AIME went from 15.6% → 71.0%. This is an example of scale-over-compute instead of scale-over-size.

Cognitive Behaviors: Why Do Some Models Learn Faster?

Research shows: Models with higher "Exploration Tendency" learn +40% faster under RL. Four identified Cognitive Behaviors are decisive:

🔍

1. Exploration

Trying diverse solution approaches instead of early convergence. Models with higher exploration find better strategies and learn faster under RL.

✓

2. Verification

Self-checking intermediate steps. Verification correlates with +35% final performance — models that validate intermediate steps make fewer errors.

♻️

3. Refinement

Iterative improvement based on feedback. Refinement behavior enables faster adaptation to new reward signals during training.

🎯

4. Adaptation

Adapting strategy for different task types. Critical for generalization across domains — prevents over-specialization on specific task types.

Paper: arxiv:2503.01307 (March 2025)
Key Finding: These behaviors are emergent — they arise during training and are not explicitly programmed. Models with strong Exploration + Verification tendency benefit most from RL-based reasoning optimizations.