RLHF: Reinforcement Learning from Human Feedback

RLHF is a training pipeline for making language models more helpful. The base model already knows how to predict text. RLHF adds a second question: which responses do humans actually prefer?

Read Pre-training and Reinforcement Learning first if you are new to the setup. This page covers the classic RLHF pipeline; many modern systems also use related preference-optimization methods.

A Concrete Example

Suppose a user asks:

How do I prepare for my calculus exam?

The model might generate three answers:

one vague answer
one correct but overly long answer
one clear, accurate, and well-structured answer

Humans rank those responses. RLHF trains the model to produce more of the third kind.

Why Pre-training Is Not Enough

A pre-trained language model is optimized for next-token prediction:

\mathcal{L}_{\text{pretrain}} = -\sum_i \log P(x_i | x_{<i})

That objective makes the model fluent, but fluency is not the same as being helpful, safe, or well-aligned with user intent.

The Three-Stage Pipeline

1. Supervised Fine-Tuning (SFT)

Start with high-quality prompt-response examples written by humans or curated by trainers.

The model is fine-tuned to imitate those examples:

\mathcal{L}_{\text{SFT}} = -\sum_i \log P_\theta(y_i | x, y_{<i})

This gives the model a reasonable baseline style.

2. Reward Model Training

Next, humans compare multiple answers to the same prompt. A separate model learns to score responses so that preferred ones get higher scores.

P(y_1 \succ y_2 | x) = \sigma(R_\phi(x, y_1) - R_\phi(x, y_2))

You do not need the formula to follow the idea. It just says: the reward model should assign a higher score to the answer humans liked more.

3. RL Fine-Tuning

Finally, use reinforcement learning to make the policy produce higher-scoring answers while staying close to the supervised model:

\mathcal{L}_{\text{RLHF}} = \mathbb{E}_{x \sim D, y \sim \pi_\theta}[R_\phi(x, y)] - \beta \cdot D_{KL}(\pi_\theta || \pi_{\text{SFT}})

In plain English:

maximize the reward model score
but do not drift too far from the sensible SFT policy

The second term is a KL penalty. It acts like a leash that keeps the model from chasing weird high-reward hacks.

Why the KL Penalty Matters

Without a constraint, the policy can learn to exploit mistakes in the reward model. That is called reward hacking.

The KL term reduces that risk by asking the updated model to stay reasonably close to the distribution of answers it produced after supervised fine-tuning.

Interactive Visualization

Explore the RLHF pipeline: SFT, reward modeling, and PPO optimization:

RLHF Pipeline

Demonstration Data

Prompt: Summarize this article...

Human demo: The article discusses three main points...

Fine-tune base LLM to mimic high-quality human responses.

Where PPO Fits In

Classic RLHF often uses PPO for the reinforcement learning stage. PPO is popular because it improves the policy gradually instead of allowing destructive jumps.

If you want the RL details, read PPO after this page.

Practical Challenges

Challenge	Why it happens	Typical fix
Reward hacking	The policy exploits reward model shortcuts	KL penalty, better reward data
Label noise	Humans do not always agree	Multiple raters, calibration
High cost	Preference data is expensive	Better collection tools, active sampling
Distribution shift	The model changes during training	Refresh or expand feedback data

Where RLHF Shows Up

Chat assistants
Instruction-following models
Safety tuning
Preference-aware generation systems

What To Remember

Pre-training teaches language patterns
RLHF teaches preference-aware behavior
The classic pipeline is SFT -> reward model -> RL
The hardest part is not optimization alone; it is getting reliable human preference data