CHAPTER 6.2b · RLHF & ALIGNMENT

RLHF vs DPO Comparison

Two paradigms for LLM alignment: Reinforcement Learning from Human Feedback (RLHF) vs Direct Preference Optimization (DPO)

RLHF vs. DPO represents two generations of alignment methods. RLHF requires a separate Reward Model, DPO optimizes directly on preference data – simpler, but with different trade-offs.

📖 Learning Context ▼

Understand the architectural differences between RLHF and DPO
Be able to weigh the trade-offs of both methods
Assess when which method is appropriate

Step 2/4 Training & Inference

Deep dive into RLHF & Alignment (2/4) – comparing alignment paradigms.

DPO (2023) revolutionizes alignment by eliminating the Reward Model. Many modern models like Llama 2 use DPO – understanding both methods is practically relevant.

RLHF: Flexible but complex (3 phases, unstable)
DPO: Direct on preferences, more stable, simpler
Practice: DPO often preferred for efficiency

Architecture Comparison

RLHF (Traditional)

Policy π

↓

Reward Model r

↓

Value Network V

↓

PPO Training

⚙️ 3 Networks: Policy, Reward, Value

📊 3 Training Phases: SFT, RM, PPO

🎯 Explicit Reward Model: Separate architecture

DPO (Modern)

Reference π_ref

↓

Policy π_θ

↓

Direct Preference Optimization

⚙️ 1 Network: Policy + Reference

📊 1 Training Phase: Direct on preferences

🎯 Implicit Reward: In loss function

Detailed Comparison

Aspect	RLHF	DPO
Number of Networks	3 (Policy, Reward, Value)	1 (Policy only)
Training Phases	3 (SFT → RM → PPO)	2 (SFT → DPO)
Complexity	High Complex	Low Simple
Stability	Medium (KL divergence tuning needed)	High (stable by default)
Hyperparameters	Many (T, β, learning rate, etc.)	Few (λ, learning rate)
Compute Efficiency	Moderate (generates samples)	Efficient 3× faster
Reward Model Quality	Separately trained Error-prone	Implicit Robust
Reward Hacking	Possible (RM exploitable)	Less susceptible
Alignment Quality	Very good (proven)	Very good (direct)
Empirical Performance	SOTA (Claude, ChatGPT)	SOTA (Llama 2-Chat)

Key Insights

RLHF is tried and tested: Developed for ChatGPT and refined in Claude, GPT-4. The method is proven but complex: You need a separate Reward Model that must be trained and monitored.

DPO is a more modern paradigm: Rafailov et al. (2023) show that preference optimization can be done directly without a Reward Model. The reward function is implicitly encoded in the loss function.

Practical comparison: RLHF requires about 3 GPUs (for 3 networks) and 3 training loops. DPO runs on 1 GPU with 1 loop. For the same results, DPO is typically 3-5× faster.

Reward Hacking problem: In RLHF, the policy can learn to "hack" the Reward Model: generating text that scores high but isn't judged as good by human judges. DPO prevents this through direct preference optimization.

Hybrid approaches: Modern research combines both: DPO for fast initial alignment, then iterative RLHF for refinement. Or: Constitutional AI with automatic critique prompts instead of an explicit Reward Model.

Future trend: DPO and variants (IPO, KPO) are becoming the new standard since they are simpler, faster, and less error-prone. Large labs (Anthropic, OpenAI) continue to use RLHF for maximum quality, but smaller teams choose DPO.

RLHF vs DPO Comparison

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways

Architecture Comparison

RLHF (Traditional)

DPO (Modern)

Detailed Comparison

Key Insights