Aligning language models with human preferences through reward modeling
RLHF (Reinforcement Learning from Human Feedback) is the technique that transformed GPT from a text predictor into a helpful assistant. By training on human preferences rather than just next-token prediction, RLHF aligns models with human values.
The Alignment Problem
Pre-trained LLMs optimize for:
This produces fluent text but not necessarily helpful, honest, or harmless responses. RLHF bridges this gap.
The Three-Stage Pipeline
Stage 1: Supervised Fine-Tuning (SFT)
Fine-tune the base model on high-quality demonstrations:
where are human-written prompt-response pairs.
Stage 2: Reward Model Training
Collect comparison data: humans rank multiple responses to the same prompt.
Train a reward model using the Bradley-Terry preference model:
Loss function:
where is the preferred response and is the rejected one.
Stage 3: RL Fine-Tuning (PPO)
Optimize the policy to maximize reward while staying close to the SFT model:
The KL penalty prevents the model from gaming the reward model with unnatural outputs.
Why KL Regularization?
Without the KL term, the model finds adversarial responses that score high on the reward model but are gibberish to humans. The constraint keeps outputs within the distribution of sensible language.
Interactive Visualization
Explore the RLHF pipeline: SFT, reward modeling, and PPO optimization:
RLHF Pipeline
Demonstration Data
Fine-tune base LLM to mimic high-quality human responses.
Practical Considerations
| Challenge | Solution |
|---|---|
| Reward hacking | KL penalty, reward model ensembles |
| Label noise | Multiple annotators, calibration |
| Distribution shift | Online data collection |
| Cost | Efficient preference collection UI |
Impact
RLHF powers:
- ChatGPT and GPT-4
- Claude
- Gemini
It’s the key technique that made LLMs usable as assistants rather than just text generators.