RLHF Pipeline – LLM Explorer

RLHF Pipeline: A Comprehensive Overview

Scroll through the three phases of Reinforcement Learning from Human Feedback: SFT, Reward Model Training, and PPO Optimization

The RLHF pipeline consists of three sequential phases that gradually transform a base model into an aligned assistant. This scrollytelling visualization guides you through each phase in detail.

📖 Learning Context ▼

Trace the data flow through all three RLHF phases
Understand the role of annotator feedback
Grasp how PPO optimizes the final model

Step 2/4 Training & Inference

Deep dive into RLHF & Alignment (2/4) – with detailed pipeline visualization.

The pipeline view shows how different models and data streams work together. This system understanding is essential for debugging and optimizing RLHF.

Phase 1 (SFT): Demonstration data → Fine-Tuned Model
Phase 2 (RM): Comparison data → Reward Model
Phase 3 (PPO): RL optimization → Aligned Model

Step 0 · Base

Base Model (Pre-Trained)

We start with a large Language Model (e.g., GPT-3 175B) that was pre-trained on massive amounts of text. This model can already generate text well, but it's not specifically "aligned" with human preferences.

Status: Can generate text, but sometimes toxic, factually incorrect, or unhelpful

Phase 1 · Supervised Fine-Tuning (SFT)

SFT: Learning from Demonstrations

Data collectors (Human Annotators) write high-quality responses to hundreds of prompts. We fine-tune the base model on these demonstrations with standard Supervised Learning (Next Token Prediction).

Data: ~10k-100k high-quality input-output pairs
Goal: Model learns to give helpful responses
Result: SFT model showing better quality

Phase 2 · Reward Model Training

RM: Training a Preference Classifier

For thousands of prompts, we have the SFT model generate multiple responses. Human annotators rank these (e.g., "Response A is better than Response B"). A separate reward network learns to predict these preferences.

Input: Prompt + two responses (generated by SFT model)
Output: Scalar Reward (higher = better)
Cost: 1.5-2× cost of SFT phase

Phase 3 · Proximal Policy Optimization (PPO)

PPO: Reinforcement Learning Training

With the trained Reward Model, we perform RL training. The model generates responses, receives rewards from the RM, and is optimized via gradient descent. A KL divergence term prevents the model from drifting too far from the original.

Loss: L = r(x,y) - β·KL(π_θ || π_ref)
Batching: 512-2048 prompts per update
Hyper-tuning: β (KL coefficient) is critical

Result · Aligned Model

The Final Output

After all three phases, we have a model that is:

More helpful (SFT taught good format)
More honest (RM punishes hallucinations)
Safer (RM punishes toxic content)
Instruction-following (all of the above)

Key Insights about the RLHF Pipeline

Three phases are not optional: Each phase serves a purpose. SFT shows the model the format. RM trains "taste classification". PPO optimizes for rewards. You cannot skip phases.

Reward Model quality is critical: If the RM is trained on bad preference data, PPO will optimize the model in the wrong direction. A bad RM is worse than no RM.

KL divergence tuning: The β parameter is crucial. Too high: Model doesn't change (PPO becomes useless). Too low: Model diverges too far from original (quality deteriorates). Typical: β=0.01-0.1.

Costs are enormous: RLHF training for large models requires hundreds to thousands of annotated preference pairs. OpenAI & Anthropic employ hundreds of annotators. This is a massive engineering investment.

Newer alternatives exist: DPO (Direct Preference Optimization) makes Reward Model training unnecessary. IPO, KPO are further simplifications. But for state-of-the-art, you still need RLHF.

Alignment is never "done": As new jailbreak techniques appear, new safety issues emerge. RLHF models need continuous updates. It's an ongoing process.

RLHF Pipeline: A Comprehensive Overview

Learning Objectives

Context: Where are we?

Why It Matters

Key Takeaways