Maximum Likelihood Reinforcement Learning (MaxRL)

A recent idea for training models on pass-fail tasks when sampling matters

Maximum Likelihood Reinforcement Learning (MaxRL) is a framework for pass-fail tasks such as math or code generation. Its central claim is that standard RL objectives do not match the real goal very well when success is measured by whether at least one sampled answer is correct.

Read Reinforcement Learning and Policy Gradient first. This page is about a recent paper and is more advanced than the core RL pages.

A Concrete Motivation

Imagine a model solving coding problems. For one prompt, it gets the right answer only 5% of the time.

Now ask two different questions:

  • RL-style question: did this single sampled answer get reward 1 or 0?
  • Sampling question: if I sample 8 answers, what is the chance that at least one is correct?

Those are not the same training target. MaxRL is about reducing that mismatch.

The Main Idea

For a prompt xx, let pθ(x)p_\theta(x) be the probability that the model produces a correct answer:

pθ(x)=yπθ(yx)1[correct(y,x)]p_\theta(x) = \sum_y \pi_\theta(y \mid x) \cdot \mathbb{1}[\text{correct}(y, x)]

If you care about correctness, the natural objective is to increase logpθ(x)\log p_\theta(x). The problem is that correctness is observed only after sampling, so people usually fall back to policy-gradient style surrogates.

MaxRL shows that common RL updates capture only the first part of the maximum-likelihood gradient.

Why Standard RL Can Be Misaligned

The paper derives:

logp=k=1(1p)kk\log p = -\sum_{k=1}^{\infty} \frac{(1-p)^k}{k}

That expansion matters because each term corresponds to a different sampling depth. Standard RL effectively optimizes only the first term. MaxRL keeps more of the structure that matters when pass rates are low and test-time sampling is important.

If the derivation feels abstract, keep this simpler idea in mind:

  • standard RL mostly cares about improving one-shot success
  • MaxRL tries to better match sample-until-success style evaluation

Interactive Visualization

Explore how the truncated series approximates log(p)\log(p) at different orders:

MaxRL: Maclaurin Expansion of log(p)

The ML objective log(p) can be expanded as −Σ (1−p)k/k. Standard RL (REINFORCE) only optimizes the k=1 term. MaxRL adds higher-order terms.

-4-3-2-1000.250.50.751p (pass rate)
log(p) (exact ML)K=1K=2K=3K=4K=5
Expansion order K (number of pass@k terms)1
K=1 (standard RL)K=5 (closer to ML)
Pass rate p0.30
Term contributions at p = 0.30
−(1−p)1/1
-0.700
Sum (K=1 approx): -0.700Exact log(p): -1.204

Key insight: When p is small (hard problems), the k=1 term alone is a poor approximation — higher-order terms matter most. MaxRL captures these terms, producing stronger gradients on hard prompts where standard RL struggles. At low p, notice how the K=1 curve diverges significantly from log(p), while adding more terms closes the gap.

The MaxRL Objective

MaxRL defines a family of objectives indexed by the number of samples KK:

JMaxRL(K)(θ)=k=1K(1pθ)kkJ_{\text{MaxRL}}^{(K)}(\theta) = -\sum_{k=1}^{K} \frac{(1 - p_\theta)^k}{k}

Two endpoints matter most:

  1. At K=1K=1: it reduces to the familiar first-order RL view
  2. As KK \to \infty: it approaches exact maximum likelihood

That is why the framework is useful conceptually: it connects ordinary RL training to a clearer maximum-likelihood target.

What Changes in Practice

The paper’s practical message is that you can reweight the policy-gradient signal by an estimate of the prompt’s pass rate. Hard prompts then get larger updates, and easy prompts get smaller ones.

# Standard GRPO-style advantage
advantage = (reward - baseline) / std

# MaxRL-style idea
mean_reward = reward.mean()  # estimated pass rate
advantage = (reward - baseline) / (std * mean_reward)

You do not need to memorize the formula. The intuition is:

  • if a prompt is already easy, do not spend too much gradient on it
  • if a prompt is very hard, a small success signal is especially valuable

Why This Helps

Pass rateStandard RL emphasisMaxRL emphasis
HighSimilar across promptsSlightly reduced relative priority
MediumModerateStronger
Very lowOften too weakMuch stronger

This makes MaxRL most interesting for problems where the model rarely succeeds on the first try but can improve through repeated sampling.

When To Use This Lens

MaxRL is most relevant when:

  • rewards are basically pass/fail
  • test-time sampling matters
  • you care about pass@K, not just pass@1
  • the tasks are hard enough that first-try success is low

Key Results

The paper reports that MaxRL improves sampling efficiency and often beats GRPO-style baselines on mathematical reasoning tasks.

For students, the higher-level takeaway is stronger than the benchmark table: the paper argues that the objective itself should better reflect how sampled generation is evaluated.

What To Remember

  • MaxRL is a bridge between RL training and maximum-likelihood training
  • It matters most on pass-fail tasks with low initial success rates
  • The key conceptual move is to care about more than the first-order RL term

Key Papers

  • Maximum Likelihood Reinforcement Learning - Tajwar, Zeng, Zhou, Song, Arora, Jiang, Schneider, Salakhutdinov, Feng, Zanette, 2026
    https://arxiv.org/abs/2602.02710
  • Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (REINFORCE) - Williams, 1992
  • GRPO: Group Relative Policy Optimization - Shao et al., 2024
Found an error or want to contribute? Edit on GitHub