Proximal Policy Optimization (PPO)

A stable, sample-efficient policy gradient algorithm for reinforcement learning

Proximal Policy Optimization (PPO) is the most widely used deep reinforcement learning algorithm. It achieves strong performance while being simpler and more stable than prior methods like TRPO.

The Challenge

Policy gradient methods can be unstable:

  • Too large an update → policy collapses
  • Too small an update → slow learning

PPO solves this with a clipped objective that limits update size.

The PPO-Clip Objective

LCLIP(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]

where:

  • rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} is the probability ratio
  • A^t\hat{A}_t is the advantage estimate
  • ϵ\epsilon is the clip range (typically 0.1-0.2)

How Clipping Works

The clip function limits how much the ratio can change:

clip(r,1ϵ,1+ϵ)={1ϵif r<1ϵrif 1ϵr1+ϵ1+ϵif r>1+ϵ\text{clip}(r, 1-\epsilon, 1+\epsilon) = \begin{cases} 1-\epsilon & \text{if } r < 1-\epsilon \\ r & \text{if } 1-\epsilon \leq r \leq 1+\epsilon \\ 1+\epsilon & \text{if } r > 1+\epsilon \end{cases}

Taking the minimum ensures:

  • If A^>0\hat{A} > 0: Don’t increase rr beyond 1+ϵ1+\epsilon
  • If A^<0\hat{A} < 0: Don’t decrease rr below 1ϵ1-\epsilon

Interactive Visualization

See how the clipping mechanism constrains policy updates:

PPO Clipping Mechanism

r=10.51.5
Unclipped
1.00
Clipped
1.00
PPO Objective
1.00

Insight: When advantage is positive, PPO prevents the ratio from going above 1+ε. When negative, it prevents going below 1-ε. This keeps updates "proximal" to the old policy.

Advantage Estimation

PPO typically uses Generalized Advantage Estimation (GAE):

A^t=l=0(γλ)lδt+l\hat{A}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}

where δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) is the TD error.

The λ\lambda parameter trades off bias vs variance.

Full Algorithm

for iteration in range(num_iterations):
    # Collect trajectories with current policy
    trajectories = collect_rollouts(policy, env)
    
    # Compute advantages
    advantages = compute_gae(trajectories, value_fn)
    
    # Multiple epochs of updates
    for epoch in range(num_epochs):
        for batch in trajectories.batches():
            # PPO update
            loss = ppo_clip_loss(batch, advantages)
            optimizer.step(loss)

Hyperparameters

ParameterTypical ValueEffect
ϵ\epsilon (clip)0.1-0.2Update constraint
γ\gamma (discount)0.99Future reward weighting
λ\lambda (GAE)0.95Advantage bias-variance
Epochs per update3-10Sample efficiency
Batch size32-4096Gradient stability
  1. Simple: Easier to implement than TRPO
  2. Stable: Clipping prevents catastrophic updates
  3. Sample efficient: Multiple epochs per rollout
  4. General: Works on continuous and discrete actions
  5. Scalable: Parallelizes well across workers

Applications

PPO powers:

  • RLHF: Aligning language models (ChatGPT, Claude)
  • Game AI: OpenAI Five, DOTA 2
  • Robotics: Manipulation, locomotion
  • Autonomous driving: Decision making
Found an error or want to contribute? Edit on GitHub