Teaching language models to prefer responses that people rank higher
RLHF is a training pipeline for making language models more helpful. The base model already knows how to predict text. RLHF adds a second question: which responses do humans actually prefer?
Read Pre-training and Reinforcement Learning first if you are new to the setup. This page covers the classic RLHF pipeline; many modern systems also use related preference-optimization methods.
A Concrete Example
Suppose a user asks:
How do I prepare for my calculus exam?
The model might generate three answers:
- one vague answer
- one correct but overly long answer
- one clear, accurate, and well-structured answer
Humans rank those responses. RLHF trains the model to produce more of the third kind.
Why Pre-training Is Not Enough
A pre-trained language model is optimized for next-token prediction:
That objective makes the model fluent, but fluency is not the same as being helpful, safe, or well-aligned with user intent.
The Three-Stage Pipeline
1. Supervised Fine-Tuning (SFT)
Start with high-quality prompt-response examples written by humans or curated by trainers.
The model is fine-tuned to imitate those examples:
This gives the model a reasonable baseline style.
2. Reward Model Training
Next, humans compare multiple answers to the same prompt. A separate model learns to score responses so that preferred ones get higher scores.
You do not need the formula to follow the idea. It just says: the reward model should assign a higher score to the answer humans liked more.
3. RL Fine-Tuning
Finally, use reinforcement learning to make the policy produce higher-scoring answers while staying close to the supervised model:
In plain English:
- maximize the reward model score
- but do not drift too far from the sensible SFT policy
The second term is a KL penalty. It acts like a leash that keeps the model from chasing weird high-reward hacks.
Why the KL Penalty Matters
Without a constraint, the policy can learn to exploit mistakes in the reward model. That is called reward hacking.
The KL term reduces that risk by asking the updated model to stay reasonably close to the distribution of answers it produced after supervised fine-tuning.
Interactive Visualization
Explore the RLHF pipeline: SFT, reward modeling, and PPO optimization:
RLHF Pipeline
Demonstration Data
Fine-tune base LLM to mimic high-quality human responses.
Where PPO Fits In
Classic RLHF often uses PPO for the reinforcement learning stage. PPO is popular because it improves the policy gradually instead of allowing destructive jumps.
If you want the RL details, read PPO after this page.
Practical Challenges
| Challenge | Why it happens | Typical fix |
|---|---|---|
| Reward hacking | The policy exploits reward model shortcuts | KL penalty, better reward data |
| Label noise | Humans do not always agree | Multiple raters, calibration |
| High cost | Preference data is expensive | Better collection tools, active sampling |
| Distribution shift | The model changes during training | Refresh or expand feedback data |
Where RLHF Shows Up
- Chat assistants
- Instruction-following models
- Safety tuning
- Preference-aware generation systems
What To Remember
- Pre-training teaches language patterns
- RLHF teaches preference-aware behavior
- The classic pipeline is SFT -> reward model -> RL
- The hardest part is not optimization alone; it is getting reliable human preference data