CHAPTER 6.2 · TRAINING & ALIGNMENT

RLHF Training & Alignment

How Reinforcement Learning from Human Feedback aligns LLMs with human values – from SFT through Reward Models to PPO

RLHF (Reinforcement Learning from Human Feedback) is the key technology that makes ChatGPT and Claude so useful. It transforms a pre-trained model into a helpful assistant – through human feedback rather than pure text prediction.

📖 Learning Context ▼

Understand the three phases of the RLHF pipeline (SFT, Reward, PPO)
Grasp why human feedback is essential
Be able to explain the role of the Reward Model

Step 2/4 Training & Inference

After Training Basics (1/4), we dive into RLHF & Alignment (2/4) – the key to helpful LLMs.

RLHF is the reason why modern LLMs follow instructions rather than just completing text. Without RLHF, GPT-4 would be a brilliant but uncontrollable text generator.

SFT: Supervised Fine-Tuning teaches the desired format
Reward Model: Learns to evaluate human preferences
PPO: Optimizes the model for high reward scores

The RLHF Pipeline: 3 Phases

RLHF is a three-stage process that gradually transforms a pre-trained language model into a helpful, harmless, and honest system.

1Supervised Fine-Tuning (SFT)

Input: Pre-trained model

Data: ~100K high-quality demonstration examples

Duration: ~2-4 weeks

The model learns to follow instructions through examples of good responses. This forms the foundation for later RL.

2Reward Model (RM)

Input: SFT Model

Data: ~50K-100K human preference pairs

Duration: ~2-3 weeks

A separate model is trained to predict response quality. It gives scores for (prompt, response) pairs.

3PPO Optimization

Input: SFT Model + RM

Data: Generated responses + RM scores

Duration: ~2-4 weeks

The model is optimized with RL to achieve higher RM scores while staying true to the original version.

💡 Key Insight: Recent research (DeepSeek R1) shows that SFT is not strictly necessary – reasoning can emerge directly from RL. DeepSeek trained the base model directly with RL and achieved 71% on AIME (from 15.6% without RL).

Reward Model: Learning Preferences

The Reward Model is the heart of RLHF. It is a trained neural network that learns to predict human preferences.

Reward Model Type

Fig. 1 | How the Reward Model evaluates responses. ORM: One score at the end. PRM: Scores for each step.

Outcome Reward (ORM)

✅ Easier to train
✅ One reward signal per response
❌ Weak credit assignment for long reasoning chains
❌ Can reward wrong paths if final result is correct

Process Reward (PRM)

✅ Better performance for mathematical reasoning
✅ Scores for EVERY step
✅ Strong credit assignment
❌ Hard to scale (every step must be annotated)

Reward Model Input/Output:

Input: (prompt, response)
Output: r(prompt, response) ∈ ℝ (scalar reward)

Trained with: Comparison data (Response A > Response B for Prompt X)

PPO: Policy Optimization with KL-Constraint

PPO (Proximal Policy Optimization) is the core algorithm of RLHF. It optimizes the model based on RM scores while keeping it close to the original version.

PPO Loss Function:

L_PPO(θ) = E[r(x, y) - β · KL(π_θ || π_ref)]

Where:
• r(x, y) = Reward from the Reward Model
• β = KL penalty strength (controls trade-off)
• π_θ = Current policy (model)
• π_ref = Reference policy (original model)

KL Penalty Strength (β) 0.15

β too small: Model maximizes rewards but loses knowledge
β too large: Too little reward signal, minimal behavior change

Fig. 2 | PPO trade-off: Reward scores vs. KL divergence. The β parameter controls the compromise between both.

Why KL-Constraint?

The KL term prevents the model from drifting too far from the original:

Without KL: Model could hack the reward (strange text sequences), would forget knowledge, could become unsafe
With KL: Model stays close to original, retains knowledge, remains safe
β parameter: Controls the balance (empirically tuned, typically 0.1-0.5)

Policy Gradients & Advantage Estimation

The mathematical foundation of RLHF is based on Policy Gradients – a technique for optimizing models with RL signals.

Policy Gradient Theorem:

∇_θ J(θ) ∝ E[∇_θ log π_θ(y|x) · A(x, y)]

A(x, y) = Advantage: How much better is this action than average?

Alternative: GRPO (Group Relative Policy Optimization)

DeepSeek R1 uses GRPO instead of classical PPO – more efficient and stable:

GRPO Advantage:

A_i = (reward_i - mean(rewards)) / std(rewards)

Advantage:
• Relative advantage within a group
• No separate Value network needed
• More stable training
• G outputs per prompt

🔍 What's the difference? PPO needs a separate critic network for value estimation. GRPO computes advantages only relative to other samples in the batch – simpler and more efficient!

Real-World Impact: o1, o3, DeepSeek R1

RLHF has led to revolutionary breakthroughs in modern LLMs:

Model	Release	RLHF Technique	AIME 2024/2025	SWE-Bench	Special Feature
GPT-4	March 2023	Standard RLHF	80.7%	—	Baseline before RL reasoning
o1	Sept 2024	RL for internal reasoning	83.3%	51.7%	First reasoning model
o3	April 2025	Improved RL	88.9%	69.1%	Massive improvements
DeepSeek-R1	Jan 2025	Pure RL (GRPO, no SFT)	71.0%	—	Reasoning from pure RL!

DeepSeek R1 Breakthrough: RL instead of SFT

The revolutionary experiment: DeepSeek trained a base model directly with GRPO RL without SFT:

Before (without RL)

15.6%

AIME 2024 Performance

After (with pure RL)

71.0%

AIME 2024 Performance

The model learned to reason through RL rewards, not through SFT examples! This was a fundamental insight: Reasoning abilities can emerge directly from RL signals when the reward function is properly designed (mathematical correctness, code execution, etc.).

Alternatives & Variations

Direct Preference Optimization (DPO)

DPO is a more modern alternative to RLHF that eliminates the separate Reward Model:

DPO Loss:

L_DPO = -E[log σ(β log(π_θ(y_w|x)/π_ref(y_w|x)) - β log(π_θ(y_l|x)/π_ref(y_l|x)))]

• y_w = Preferred response
• y_l = Non-preferred response
• σ = Sigmoid function

RLHF Pipeline

✅ Proven method
✅ Good performance
❌ 3 separate training stages
❌ Separate RM requires resources

DPO Alternative

✅ Direct training without RM
✅ Simpler and faster
✅ Less memory
✅ Comparable or better performance

Constitutional AI (Anthropic)

Anthropic's approach: AI-generated feedback instead of just human labels.

Process:

Model generates a response
Model itself critiques the response based on principles (Constitution)
Model revises the response
RLHF is applied to the AI-generated preferences

Advantage: Significantly reduces human annotation costs while maintaining high quality!

Key Insights

1. RLHF is Alignment, not Capability

RLHF makes the model safer and more helpful, but doesn't create new abilities – it redirects existing capabilities.

2. RL Can Create New Capabilities (!)

DeepSeek R1 disproves this: Reasoning abilities emerged directly from RL without SFT. The model learned to think through rewards.

3. Reward Design is Critical

The form of the reward function determines what the model learns. Poorly designed rewards lead to undesired behavior.

4. KL-Constraint is Essential

The KL term prevents the RL process from destroying the model. It's the safety net of the entire approach.

RLHF Training & Alignment

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways

The RLHF Pipeline: 3 Phases

1Supervised Fine-Tuning (SFT)

2Reward Model (RM)

3PPO Optimization

Reward Model: Learning Preferences

Outcome Reward (ORM)

Process Reward (PRM)

PPO: Policy Optimization with KL-Constraint

Why KL-Constraint?

Policy Gradients & Advantage Estimation

Alternative: GRPO (Group Relative Policy Optimization)

Real-World Impact: o1, o3, DeepSeek R1

DeepSeek R1 Breakthrough: RL instead of SFT

Before (without RL)

After (with pure RL)

Alternatives & Variations

Direct Preference Optimization (DPO)

RLHF Pipeline

DPO Alternative

Constitutional AI (Anthropic)

Key Insights

1. RLHF is Alignment, not Capability

2. RL Can Create New Capabilities (!)

3. Reward Design is Critical

4. KL-Constraint is Essential

The RLHF Pipeline: 3 Phases

1Supervised Fine-Tuning (SFT)

2Reward Model (RM)

3PPO Optimization

Reward Model: Learning Preferences

Outcome Reward (ORM)

Process Reward (PRM)

PPO: Policy Optimization with KL-Constraint

Why KL-Constraint?

Policy Gradients & Advantage Estimation

Alternative: GRPO (Group Relative Policy Optimization)

Real-World Impact: o1, o3, DeepSeek R1

DeepSeek R1 Breakthrough: RL instead of SFT

Before (without RL)

After (with pure RL)

Alternatives & Variations

Direct Preference Optimization (DPO)

RLHF Pipeline

DPO Alternative

Constitutional AI (Anthropic)

Key Insights

1. RLHF is Alignment, not Capability

2. RL Can Create New Capabilities (!)​

3. Reward Design is Critical

4. KL-Constraint is Essential

2. RL Can Create New Capabilities (!)