How Reinforcement Learning from Human Feedback aligns LLMs with human values – from SFT through Reward Models to PPO
RLHF (Reinforcement Learning from Human Feedback) is the key technology that makes ChatGPT and Claude so useful. It transforms a pre-trained model into a helpful assistant – through human feedback rather than pure text prediction.
After Training Basics (1/4), we dive into RLHF & Alignment (2/4) – the key to helpful LLMs.
RLHF is the reason why modern LLMs follow instructions rather than just completing text. Without RLHF, GPT-4 would be a brilliant but uncontrollable text generator.
RLHF is a three-stage process that gradually transforms a pre-trained language model into a helpful, harmless, and honest system.
The model learns to follow instructions through examples of good responses. This forms the foundation for later RL.
A separate model is trained to predict response quality. It gives scores for (prompt, response) pairs.
The model is optimized with RL to achieve higher RM scores while staying true to the original version.
The Reward Model is the heart of RLHF. It is a trained neural network that learns to predict human preferences.
PPO (Proximal Policy Optimization) is the core algorithm of RLHF. It optimizes the model based on RM scores while keeping it close to the original version.
β too small: Model maximizes rewards but loses knowledge
β too large: Too little reward signal, minimal behavior change
The KL term prevents the model from drifting too far from the original:
The mathematical foundation of RLHF is based on Policy Gradients – a technique for optimizing models with RL signals.
DeepSeek R1 uses GRPO instead of classical PPO – more efficient and stable:
RLHF has led to revolutionary breakthroughs in modern LLMs:
| Model | Release | RLHF Technique | AIME 2024/2025 | SWE-Bench | Special Feature |
|---|---|---|---|---|---|
| GPT-4 | March 2023 | Standard RLHF | 80.7% | — | Baseline before RL reasoning |
| o1 | Sept 2024 | RL for internal reasoning | 83.3% | 51.7% | First reasoning model |
| o3 | April 2025 | Improved RL | 88.9% | 69.1% | Massive improvements |
| DeepSeek-R1 | Jan 2025 | Pure RL (GRPO, no SFT) | 71.0% | — | Reasoning from pure RL! |
The revolutionary experiment: DeepSeek trained a base model directly with GRPO RL without SFT:
The model learned to reason through RL rewards, not through SFT examples! This was a fundamental insight: Reasoning abilities can emerge directly from RL signals when the reward function is properly designed (mathematical correctness, code execution, etc.).
DPO is a more modern alternative to RLHF that eliminates the separate Reward Model:
Anthropic's approach: AI-generated feedback instead of just human labels.
Process:
Advantage: Significantly reduces human annotation costs while maintaining high quality!
RLHF makes the model safer and more helpful, but doesn't create new abilities – it redirects existing capabilities.
DeepSeek R1 disproves this: Reasoning abilities emerged directly from RL without SFT. The model learned to think through rewards.
The form of the reward function determines what the model learns. Poorly designed rewards lead to undesired behavior.
The KL term prevents the RL process from destroying the model. It's the safety net of the entire approach.