Maximum Likelihood Reinforcement Learning (MaxRL)

Maximum Likelihood Reinforcement Learning (MaxRL) is a framework for pass-fail tasks such as math or code generation. Its central claim is that standard RL objectives do not match the real goal very well when success is measured by whether at least one sampled answer is correct.

Read Reinforcement Learning and Policy Gradient first. This page is about a recent paper and is more advanced than the core RL pages.

A Concrete Motivation

Imagine a model solving coding problems. For one prompt, it gets the right answer only 5% of the time.

Now ask two different questions:

RL-style question: did this single sampled answer get reward 1 or 0?
Sampling question: if I sample 8 answers, what is the chance that at least one is correct?

Those are not the same training target. MaxRL is about reducing that mismatch.

The Main Idea

For a prompt $x$ , let $p_\theta(x)$ be the probability that the model produces a correct answer:

p_\theta(x) = \sum_y \pi_\theta(y \mid x) \cdot \mathbb{1}[\text{correct}(y, x)]

If you care about correctness, the natural objective is to increase $\log p_\theta(x)$ . The problem is that correctness is observed only after sampling, so people usually fall back to policy-gradient style surrogates.

MaxRL shows that common RL updates capture only the first part of the maximum-likelihood gradient.

Why Standard RL Can Be Misaligned

The paper derives:

\log p = -\sum_{k=1}^{\infty} \frac{(1-p)^k}{k}

That expansion matters because each term corresponds to a different sampling depth. Standard RL effectively optimizes only the first term. MaxRL keeps more of the structure that matters when pass rates are low and test-time sampling is important.

If the derivation feels abstract, keep this simpler idea in mind:

standard RL mostly cares about improving one-shot success
MaxRL tries to better match sample-until-success style evaluation

Interactive Visualization

Explore how the truncated series approximates $\log(p)$ at different orders:

MaxRL: Maclaurin Expansion of log(p)

The ML objective log(p) can be expanded as −Σ (1−p)^k/k. Standard RL (REINFORCE) only optimizes the k=1 term. MaxRL adds higher-order terms.

log(p) (exact ML)K=1K=2K=3K=4K=5

Expansion order K (number of pass@k terms)1

K=1 (standard RL)K=5 (closer to ML)

Pass rate p0.30

Term contributions at p = 0.30

−(1−p)¹/1

-0.700

Sum (K=1 approx): -0.700Exact log(p): -1.204

Key insight: When p is small (hard problems), the k=1 term alone is a poor approximation — higher-order terms matter most. MaxRL captures these terms, producing stronger gradients on hard prompts where standard RL struggles. At low p, notice how the K=1 curve diverges significantly from log(p), while adding more terms closes the gap.

The MaxRL Objective

MaxRL defines a family of objectives indexed by the number of samples $K$ :

J_{\text{MaxRL}}^{(K)}(\theta) = -\sum_{k=1}^{K} \frac{(1 - p_\theta)^k}{k}

Two endpoints matter most:

At $K=1$ : it reduces to the familiar first-order RL view
As $K \to \infty$ : it approaches exact maximum likelihood

That is why the framework is useful conceptually: it connects ordinary RL training to a clearer maximum-likelihood target.

What Changes in Practice

The paper’s practical message is that you can reweight the policy-gradient signal by an estimate of the prompt’s pass rate. Hard prompts then get larger updates, and easy prompts get smaller ones.

# Standard GRPO-style advantage
advantage = (reward - baseline) / std

# MaxRL-style idea
mean_reward = reward.mean()  # estimated pass rate
advantage = (reward - baseline) / (std * mean_reward)

You do not need to memorize the formula. The intuition is:

if a prompt is already easy, do not spend too much gradient on it
if a prompt is very hard, a small success signal is especially valuable

Why This Helps

Pass rate	Standard RL emphasis	MaxRL emphasis
High	Similar across prompts	Slightly reduced relative priority
Medium	Moderate	Stronger
Very low	Often too weak	Much stronger

This makes MaxRL most interesting for problems where the model rarely succeeds on the first try but can improve through repeated sampling.

When To Use This Lens

MaxRL is most relevant when:

rewards are basically pass/fail
test-time sampling matters
you care about pass@K, not just pass@1
the tasks are hard enough that first-try success is low

Key Results

The paper reports that MaxRL improves sampling efficiency and often beats GRPO-style baselines on mathematical reasoning tasks.

For students, the higher-level takeaway is stronger than the benchmark table: the paper argues that the objective itself should better reflect how sampled generation is evaluated.

What To Remember

MaxRL is a bridge between RL training and maximum-likelihood training
It matters most on pass-fail tasks with low initial success rates
The key conceptual move is to care about more than the first-order RL term

Key Papers

Maximum Likelihood Reinforcement Learning - Tajwar, Zeng, Zhou, Song, Arora, Jiang, Schneider, Salakhutdinov, Feng, Zanette, 2026
https://arxiv.org/abs/2602.02710
Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning (REINFORCE) - Williams, 1992
GRPO: Group Relative Policy Optimization - Shao et al., 2024