A stable, sample-efficient policy gradient algorithm for reinforcement learning
Proximal Policy Optimization (PPO) is the most widely used deep reinforcement learning algorithm. It achieves strong performance while being simpler and more stable than prior methods like TRPO.
The Challenge
Policy gradient methods can be unstable:
- Too large an update → policy collapses
- Too small an update → slow learning
PPO solves this with a clipped objective that limits update size.
The PPO-Clip Objective
where:
- is the probability ratio
- is the advantage estimate
- is the clip range (typically 0.1-0.2)
How Clipping Works
The clip function limits how much the ratio can change:
Taking the minimum ensures:
- If : Don’t increase beyond
- If : Don’t decrease below
Interactive Visualization
See how the clipping mechanism constrains policy updates:
PPO Clipping Mechanism
Insight: When advantage is positive, PPO prevents the ratio from going above 1+ε. When negative, it prevents going below 1-ε. This keeps updates "proximal" to the old policy.
Advantage Estimation
PPO typically uses Generalized Advantage Estimation (GAE):
where is the TD error.
The parameter trades off bias vs variance.
Full Algorithm
for iteration in range(num_iterations):
# Collect trajectories with current policy
trajectories = collect_rollouts(policy, env)
# Compute advantages
advantages = compute_gae(trajectories, value_fn)
# Multiple epochs of updates
for epoch in range(num_epochs):
for batch in trajectories.batches():
# PPO update
loss = ppo_clip_loss(batch, advantages)
optimizer.step(loss)
Hyperparameters
| Parameter | Typical Value | Effect |
|---|---|---|
| (clip) | 0.1-0.2 | Update constraint |
| (discount) | 0.99 | Future reward weighting |
| (GAE) | 0.95 | Advantage bias-variance |
| Epochs per update | 3-10 | Sample efficiency |
| Batch size | 32-4096 | Gradient stability |
Why PPO is Popular
- Simple: Easier to implement than TRPO
- Stable: Clipping prevents catastrophic updates
- Sample efficient: Multiple epochs per rollout
- General: Works on continuous and discrete actions
- Scalable: Parallelizes well across workers
Applications
PPO powers:
- RLHF: Aligning language models (ChatGPT, Claude)
- Game AI: OpenAI Five, DOTA 2
- Robotics: Manipulation, locomotion
- Autonomous driving: Decision making