Architecture Comparison

RLHF (Traditional)

Policy π
Value Network V
PPO Training
⚙️ 3 Networks: Policy, Reward, Value
📊 3 Training Phases: SFT, RM, PPO
🎯 Explicit Reward Model: Separate architecture

DPO (Modern)

Reference π_ref
Policy π_θ
Direct Preference Optimization
⚙️ 1 Network: Policy + Reference
📊 1 Training Phase: Direct on preferences
🎯 Implicit Reward: In loss function

Detailed Comparison

Aspect RLHF DPO
Number of Networks 3 (Policy, Reward, Value) 1 (Policy only)
Training Phases 3 (SFT → RM → PPO) 2 (SFT → DPO)
Complexity High Complex Low Simple
Stability Medium (KL divergence tuning needed) High (stable by default)
Hyperparameters Many (T, β, learning rate, etc.) Few (λ, learning rate)
Compute Efficiency Moderate (generates samples) Efficient 3× faster
Reward Model Quality Separately trained Error-prone Implicit Robust
Reward Hacking Possible (RM exploitable) Less susceptible
Alignment Quality Very good (proven) Very good (direct)
Empirical Performance SOTA (Claude, ChatGPT) SOTA (Llama 2-Chat)

Key Insights

1
RLHF is tried and tested: Developed for ChatGPT and refined in Claude, GPT-4. The method is proven but complex: You need a separate Reward Model that must be trained and monitored.
2
DPO is a more modern paradigm: Rafailov et al. (2023) show that preference optimization can be done directly without a Reward Model. The reward function is implicitly encoded in the loss function.
3
Practical comparison: RLHF requires about 3 GPUs (for 3 networks) and 3 training loops. DPO runs on 1 GPU with 1 loop. For the same results, DPO is typically 3-5× faster.
4
Reward Hacking problem: In RLHF, the policy can learn to "hack" the Reward Model: generating text that scores high but isn't judged as good by human judges. DPO prevents this through direct preference optimization.
5
Hybrid approaches: Modern research combines both: DPO for fast initial alignment, then iterative RLHF for refinement. Or: Constitutional AI with automatic critique prompts instead of an explicit Reward Model.
6
Future trend: DPO and variants (IPO, KPO) are becoming the new standard since they are simpler, faster, and less error-prone. Large labs (Anthropic, OpenAI) continue to use RLHF for maximum quality, but smaller teams choose DPO.