Directly optimizing policies through gradient ascent on expected returns
Policy Gradient Methods directly optimize the policy by computing gradients of the expected return. Unlike value-based methods, they naturally handle continuous actions and stochastic policies.
The Objective
Maximize expected return:
We need to do gradient ascent.
The Policy Gradient Theorem
The key result that makes this tractable:
where is the return from step .
Why Log Probability?
The “log-derivative trick”:
This lets us estimate gradients through sampling without knowing environment dynamics.
REINFORCE Algorithm
The simplest policy gradient:
for episode in episodes:
states, actions, rewards = collect_trajectory(policy)
returns = compute_returns(rewards, gamma)
loss = 0
for s, a, R in zip(states, actions, returns):
loss -= log_prob(policy(s), a) * R
optimizer.zero_grad()
loss.backward()
optimizer.step()
Interactive Visualization
Watch how policy gradients update a simple policy:
Policy Gradient Learning
Episode: 0Goal: Learn that action "→" gives reward. Policy starts uniform, then learns.
Update rule: Increase π(a|s) when action a gets positive advantage, decrease when negative. ∇log π(a|s) × advantage.
High Variance Problem
REINFORCE has high variance because:
- Returns vary widely across episodes
- Credit assignment is imprecise (which action caused the reward?)
Variance Reduction: Baselines
Subtract a baseline from returns:
This doesn’t change the expected gradient but reduces variance. Common choice: .
Actor-Critic Methods
Use a learned value function as baseline:
- Actor: Policy
- Critic: Value function
The advantage tells us if the action was better or worse than average.
Key Algorithms
| Algorithm | Key Idea |
|---|---|
| REINFORCE | Vanilla policy gradient |
| A2C | Actor-critic with advantage |
| A3C | Asynchronous parallel training |
| PPO | Clipped surrogate objective |
| TRPO | Trust region constraint |
Advantages of Policy Gradients
- Continuous actions: Natural parameterization
- Stochastic policies: Better exploration
- Direct optimization: Optimize what you care about
- Convergence: Guaranteed to local optimum
When to Use
- Continuous action spaces
- When you need stochastic policies
- When state representation is rich enough