The RLHF Pipeline: 3 Phases

RLHF is a three-stage process that gradually transforms a pre-trained language model into a helpful, harmless, and honest system.

1Supervised Fine-Tuning (SFT)

Input: Pre-trained model
Data: ~100K high-quality demonstration examples
Duration: ~2-4 weeks

The model learns to follow instructions through examples of good responses. This forms the foundation for later RL.

2Reward Model (RM)

Input: SFT Model
Data: ~50K-100K human preference pairs
Duration: ~2-3 weeks

A separate model is trained to predict response quality. It gives scores for (prompt, response) pairs.

3PPO Optimization

Input: SFT Model + RM
Data: Generated responses + RM scores
Duration: ~2-4 weeks

The model is optimized with RL to achieve higher RM scores while staying true to the original version.

💡 Key Insight: Recent research (DeepSeek R1) shows that SFT is not strictly necessary – reasoning can emerge directly from RL. DeepSeek trained the base model directly with RL and achieved 71% on AIME (from 15.6% without RL).

Reward Model: Learning Preferences

The Reward Model is the heart of RLHF. It is a trained neural network that learns to predict human preferences.

Fig. 1 | How the Reward Model evaluates responses. ORM: One score at the end. PRM: Scores for each step.

Outcome Reward (ORM)

  • ✅ Easier to train
  • ✅ One reward signal per response
  • ❌ Weak credit assignment for long reasoning chains
  • ❌ Can reward wrong paths if final result is correct

Process Reward (PRM)

  • ✅ Better performance for mathematical reasoning
  • ✅ Scores for EVERY step
  • ✅ Strong credit assignment
  • ❌ Hard to scale (every step must be annotated)
Reward Model Input/Output:
Input: (prompt, response)
Output: r(prompt, response) ∈ ℝ (scalar reward)

Trained with: Comparison data (Response A > Response B for Prompt X)

PPO: Policy Optimization with KL-Constraint

PPO (Proximal Policy Optimization) is the core algorithm of RLHF. It optimizes the model based on RM scores while keeping it close to the original version.

PPO Loss Function:
L_PPO(θ) = E[r(x, y) - β · KL(π_θ || π_ref)]

Where:
• r(x, y) = Reward from the Reward Model
• β = KL penalty strength (controls trade-off)
• π_θ = Current policy (model)
• π_ref = Reference policy (original model)

β too small: Model maximizes rewards but loses knowledge
β too large: Too little reward signal, minimal behavior change

Fig. 2 | PPO trade-off: Reward scores vs. KL divergence. The β parameter controls the compromise between both.

Why KL-Constraint?

The KL term prevents the model from drifting too far from the original:

Policy Gradients & Advantage Estimation

The mathematical foundation of RLHF is based on Policy Gradients – a technique for optimizing models with RL signals.

Policy Gradient Theorem:
∇_θ J(θ) ∝ E[∇_θ log π_θ(y|x) · A(x, y)]

A(x, y) = Advantage: How much better is this action than average?

Alternative: GRPO (Group Relative Policy Optimization)

DeepSeek R1 uses GRPO instead of classical PPO – more efficient and stable:

GRPO Advantage:
A_i = (reward_i - mean(rewards)) / std(rewards)

Advantage:
• Relative advantage within a group
• No separate Value network needed
• More stable training
• G outputs per prompt
🔍 What's the difference? PPO needs a separate critic network for value estimation. GRPO computes advantages only relative to other samples in the batch – simpler and more efficient!

Real-World Impact: o1, o3, DeepSeek R1

RLHF has led to revolutionary breakthroughs in modern LLMs:

Model Release RLHF Technique AIME 2024/2025 SWE-Bench Special Feature
GPT-4 March 2023 Standard RLHF 80.7% Baseline before RL reasoning
o1 Sept 2024 RL for internal reasoning 83.3% 51.7% First reasoning model
o3 April 2025 Improved RL 88.9% 69.1% Massive improvements
DeepSeek-R1 Jan 2025 Pure RL (GRPO, no SFT) 71.0% Reasoning from pure RL!

DeepSeek R1 Breakthrough: RL instead of SFT

The revolutionary experiment: DeepSeek trained a base model directly with GRPO RL without SFT:

Before (without RL)

15.6%
AIME 2024 Performance

After (with pure RL)

71.0%
AIME 2024 Performance

The model learned to reason through RL rewards, not through SFT examples! This was a fundamental insight: Reasoning abilities can emerge directly from RL signals when the reward function is properly designed (mathematical correctness, code execution, etc.).

Alternatives & Variations

Direct Preference Optimization (DPO)

DPO is a more modern alternative to RLHF that eliminates the separate Reward Model:

DPO Loss:
L_DPO = -E[log σ(β log(π_θ(y_w|x)/π_ref(y_w|x)) - β log(π_θ(y_l|x)/π_ref(y_l|x)))]

• y_w = Preferred response
• y_l = Non-preferred response
• σ = Sigmoid function

RLHF Pipeline

  • ✅ Proven method
  • ✅ Good performance
  • ❌ 3 separate training stages
  • ❌ Separate RM requires resources

DPO Alternative

  • ✅ Direct training without RM
  • ✅ Simpler and faster
  • ✅ Less memory
  • ✅ Comparable or better performance

Constitutional AI (Anthropic)

Anthropic's approach: AI-generated feedback instead of just human labels.

Process:

Advantage: Significantly reduces human annotation costs while maintaining high quality!

Key Insights

1. RLHF is Alignment, not Capability

RLHF makes the model safer and more helpful, but doesn't create new abilities – it redirects existing capabilities.

2. RL Can Create New Capabilities (!)​

DeepSeek R1 disproves this: Reasoning abilities emerged directly from RL without SFT. The model learned to think through rewards.

3. Reward Design is Critical

The form of the reward function determines what the model learns. Poorly designed rewards lead to undesired behavior.

4. KL-Constraint is Essential

The KL term prevents the RL process from destroying the model. It's the safety net of the entire approach.