Scroll through the three phases of Reinforcement Learning from Human Feedback: SFT, Reward Model Training, and PPO Optimization
The RLHF pipeline consists of three sequential phases that gradually transform a base model into an aligned assistant. This scrollytelling visualization guides you through each phase in detail.
Deep dive into RLHF & Alignment (2/4) – with detailed pipeline visualization.
The pipeline view shows how different models and data streams work together. This system understanding is essential for debugging and optimizing RLHF.
We start with a large Language Model (e.g., GPT-3 175B) that was pre-trained on massive amounts of text. This model can already generate text well, but it's not specifically "aligned" with human preferences.
Data collectors (Human Annotators) write high-quality responses to hundreds of prompts. We fine-tune the base model on these demonstrations with standard Supervised Learning (Next Token Prediction).
For thousands of prompts, we have the SFT model generate multiple responses. Human annotators rank these (e.g., "Response A is better than Response B"). A separate reward network learns to predict these preferences.
With the trained Reward Model, we perform RL training. The model generates responses, receives rewards from the RM, and is optimized via gradient descent. A KL divergence term prevents the model from drifting too far from the original.
After all three phases, we have a model that is: