RLHF: Reinforcement Learning from Human Feedback

Teaching language models to prefer responses that people rank higher

RLHF is a training pipeline for making language models more helpful. The base model already knows how to predict text. RLHF adds a second question: which responses do humans actually prefer?

Read Pre-training and Reinforcement Learning first if you are new to the setup. This page covers the classic RLHF pipeline; many modern systems also use related preference-optimization methods.

A Concrete Example

Suppose a user asks:

How do I prepare for my calculus exam?

The model might generate three answers:

  • one vague answer
  • one correct but overly long answer
  • one clear, accurate, and well-structured answer

Humans rank those responses. RLHF trains the model to produce more of the third kind.

Why Pre-training Is Not Enough

A pre-trained language model is optimized for next-token prediction:

Lpretrain=ilogP(xix<i)\mathcal{L}_{\text{pretrain}} = -\sum_i \log P(x_i | x_{<i})

That objective makes the model fluent, but fluency is not the same as being helpful, safe, or well-aligned with user intent.

The Three-Stage Pipeline

1. Supervised Fine-Tuning (SFT)

Start with high-quality prompt-response examples written by humans or curated by trainers.

The model is fine-tuned to imitate those examples:

LSFT=ilogPθ(yix,y<i)\mathcal{L}_{\text{SFT}} = -\sum_i \log P_\theta(y_i | x, y_{<i})

This gives the model a reasonable baseline style.

2. Reward Model Training

Next, humans compare multiple answers to the same prompt. A separate model learns to score responses so that preferred ones get higher scores.

P(y1y2x)=σ(Rϕ(x,y1)Rϕ(x,y2))P(y_1 \succ y_2 | x) = \sigma(R_\phi(x, y_1) - R_\phi(x, y_2))

You do not need the formula to follow the idea. It just says: the reward model should assign a higher score to the answer humans liked more.

3. RL Fine-Tuning

Finally, use reinforcement learning to make the policy produce higher-scoring answers while staying close to the supervised model:

LRLHF=ExD,yπθ[Rϕ(x,y)]βDKL(πθπSFT)\mathcal{L}_{\text{RLHF}} = \mathbb{E}_{x \sim D, y \sim \pi_\theta}[R_\phi(x, y)] - \beta \cdot D_{KL}(\pi_\theta || \pi_{\text{SFT}})

In plain English:

  • maximize the reward model score
  • but do not drift too far from the sensible SFT policy

The second term is a KL penalty. It acts like a leash that keeps the model from chasing weird high-reward hacks.

Why the KL Penalty Matters

Without a constraint, the policy can learn to exploit mistakes in the reward model. That is called reward hacking.

The KL term reduces that risk by asking the updated model to stay reasonably close to the distribution of answers it produced after supervised fine-tuning.

Interactive Visualization

Explore the RLHF pipeline: SFT, reward modeling, and PPO optimization:

RLHF Pipeline

Demonstration Data

Prompt: Summarize this article...
Human demo: The article discusses three main points...

Fine-tune base LLM to mimic high-quality human responses.

Where PPO Fits In

Classic RLHF often uses PPO for the reinforcement learning stage. PPO is popular because it improves the policy gradually instead of allowing destructive jumps.

If you want the RL details, read PPO after this page.

Practical Challenges

ChallengeWhy it happensTypical fix
Reward hackingThe policy exploits reward model shortcutsKL penalty, better reward data
Label noiseHumans do not always agreeMultiple raters, calibration
High costPreference data is expensiveBetter collection tools, active sampling
Distribution shiftThe model changes during trainingRefresh or expand feedback data

Where RLHF Shows Up

  • Chat assistants
  • Instruction-following models
  • Safety tuning
  • Preference-aware generation systems

What To Remember

  • Pre-training teaches language patterns
  • RLHF teaches preference-aware behavior
  • The classic pipeline is SFT -> reward model -> RL
  • The hardest part is not optimization alone; it is getting reliable human preference data
Found an error or want to contribute? Edit on GitHub