Two paradigms for LLM alignment: Reinforcement Learning from Human Feedback (RLHF) vs Direct Preference Optimization (DPO)
RLHF vs. DPO represents two generations of alignment methods. RLHF requires a separate Reward Model, DPO optimizes directly on preference data – simpler, but with different trade-offs.
Deep dive into RLHF & Alignment (2/4) – comparing alignment paradigms.
DPO (2023) revolutionizes alignment by eliminating the Reward Model. Many modern models like Llama 2 use DPO – understanding both methods is practically relevant.
| Aspect | RLHF | DPO |
|---|---|---|
| Number of Networks | 3 (Policy, Reward, Value) | 1 (Policy only) |
| Training Phases | 3 (SFT → RM → PPO) | 2 (SFT → DPO) |
| Complexity | High Complex | Low Simple |
| Stability | Medium (KL divergence tuning needed) | High (stable by default) |
| Hyperparameters | Many (T, β, learning rate, etc.) | Few (λ, learning rate) |
| Compute Efficiency | Moderate (generates samples) | Efficient 3× faster |
| Reward Model Quality | Separately trained Error-prone | Implicit Robust |
| Reward Hacking | Possible (RM exploitable) | Less susceptible |
| Alignment Quality | Very good (proven) | Very good (direct) |
| Empirical Performance | SOTA (Claude, ChatGPT) | SOTA (Llama 2-Chat) |