RLHF: Reinforcement Learning from Human Feedback

Aligning language models with human preferences through reward modeling

RLHF (Reinforcement Learning from Human Feedback) is the technique that transformed GPT from a text predictor into a helpful assistant. By training on human preferences rather than just next-token prediction, RLHF aligns models with human values.

The Alignment Problem

Pre-trained LLMs optimize for:

Lpretrain=ilogP(xix<i)\mathcal{L}_{\text{pretrain}} = -\sum_i \log P(x_i | x_{<i})

This produces fluent text but not necessarily helpful, honest, or harmless responses. RLHF bridges this gap.

The Three-Stage Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Fine-tune the base model on high-quality demonstrations:

LSFT=ilogPθ(yix,y<i)\mathcal{L}_{\text{SFT}} = -\sum_i \log P_\theta(y_i | x, y_{<i})

where (x,y)(x, y) are human-written prompt-response pairs.

Stage 2: Reward Model Training

Collect comparison data: humans rank multiple responses to the same prompt.

Train a reward model RϕR_\phi using the Bradley-Terry preference model:

P(y1y2x)=σ(Rϕ(x,y1)Rϕ(x,y2))P(y_1 \succ y_2 | x) = \sigma(R_\phi(x, y_1) - R_\phi(x, y_2))

Loss function:

LRM=E(x,yw,yl)[logσ(Rϕ(x,yw)Rϕ(x,yl))]\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)}[\log \sigma(R_\phi(x, y_w) - R_\phi(x, y_l))]

where ywy_w is the preferred response and yly_l is the rejected one.

Stage 3: RL Fine-Tuning (PPO)

Optimize the policy to maximize reward while staying close to the SFT model:

LRLHF=ExD,yπθ[Rϕ(x,y)]βDKL(πθπSFT)\mathcal{L}_{\text{RLHF}} = \mathbb{E}_{x \sim D, y \sim \pi_\theta}[R_\phi(x, y)] - \beta \cdot D_{KL}(\pi_\theta || \pi_{\text{SFT}})

The KL penalty prevents the model from gaming the reward model with unnatural outputs.

Why KL Regularization?

Without the KL term, the model finds adversarial responses that score high on the reward model but are gibberish to humans. The constraint keeps outputs within the distribution of sensible language.

Interactive Visualization

Explore the RLHF pipeline: SFT, reward modeling, and PPO optimization:

RLHF Pipeline

Demonstration Data

Prompt: Summarize this article...
Human demo: The article discusses three main points...

Fine-tune base LLM to mimic high-quality human responses.

Practical Considerations

ChallengeSolution
Reward hackingKL penalty, reward model ensembles
Label noiseMultiple annotators, calibration
Distribution shiftOnline data collection
CostEfficient preference collection UI

Impact

RLHF powers:

  • ChatGPT and GPT-4
  • Claude
  • Gemini

It’s the key technique that made LLMs usable as assistants rather than just text generators.

Found an error or want to contribute? Edit on GitHub