Step 0 · Base

Base Model (Pre-Trained)

We start with a large Language Model (e.g., GPT-3 175B) that was pre-trained on massive amounts of text. This model can already generate text well, but it's not specifically "aligned" with human preferences.

Status: Can generate text, but sometimes toxic, factually incorrect, or unhelpful
Phase 1 · Supervised Fine-Tuning (SFT)

SFT: Learning from Demonstrations

Data collectors (Human Annotators) write high-quality responses to hundreds of prompts. We fine-tune the base model on these demonstrations with standard Supervised Learning (Next Token Prediction).

Data: ~10k-100k high-quality input-output pairs
Goal: Model learns to give helpful responses
Result: SFT model showing better quality
Phase 2 · Reward Model Training

RM: Training a Preference Classifier

For thousands of prompts, we have the SFT model generate multiple responses. Human annotators rank these (e.g., "Response A is better than Response B"). A separate reward network learns to predict these preferences.

Input: Prompt + two responses (generated by SFT model)
Output: Scalar Reward (higher = better)
Cost: 1.5-2× cost of SFT phase
Phase 3 · Proximal Policy Optimization (PPO)

PPO: Reinforcement Learning Training

With the trained Reward Model, we perform RL training. The model generates responses, receives rewards from the RM, and is optimized via gradient descent. A KL divergence term prevents the model from drifting too far from the original.

Loss: L = r(x,y) - β·KL(π_θ || π_ref)
Batching: 512-2048 prompts per update
Hyper-tuning: β (KL coefficient) is critical
Result · Aligned Model

The Final Output

After all three phases, we have a model that is:

  • More helpful (SFT taught good format)
  • More honest (RM punishes hallucinations)
  • Safer (RM punishes toxic content)
  • Instruction-following (all of the above)

Key Insights about the RLHF Pipeline

1
Three phases are not optional: Each phase serves a purpose. SFT shows the model the format. RM trains "taste classification". PPO optimizes for rewards. You cannot skip phases.
2
Reward Model quality is critical: If the RM is trained on bad preference data, PPO will optimize the model in the wrong direction. A bad RM is worse than no RM.
3
KL divergence tuning: The β parameter is crucial. Too high: Model doesn't change (PPO becomes useless). Too low: Model diverges too far from original (quality deteriorates). Typical: β=0.01-0.1.
4
Costs are enormous: RLHF training for large models requires hundreds to thousands of annotated preference pairs. OpenAI & Anthropic employ hundreds of annotators. This is a massive engineering investment.
5
Newer alternatives exist: DPO (Direct Preference Optimization) makes Reward Model training unnecessary. IPO, KPO are further simplifications. But for state-of-the-art, you still need RLHF.
6
Alignment is never "done": As new jailbreak techniques appear, new safety issues emerge. RLHF models need continuous updates. It's an ongoing process.