Policy Gradient Methods

Directly optimizing policies through gradient ascent on expected returns

Policy Gradient Methods directly optimize the policy πθ(as)\pi_\theta(a|s) by computing gradients of the expected return. Unlike value-based methods, they naturally handle continuous actions and stochastic policies.

The Objective

Maximize expected return:

J(θ)=Eτπθ[t=0Tγtrt]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right]

We need θJ(θ)\nabla_\theta J(\theta) to do gradient ascent.

The Policy Gradient Theorem

The key result that makes this tractable:

θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)Rt]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t\right]

where Rt=t=tTγttrtR_t = \sum_{t'=t}^{T} \gamma^{t'-t} r_{t'} is the return from step tt.

Why Log Probability?

The “log-derivative trick”:

θπθ(as)=πθ(as)θlogπθ(as)\nabla_\theta \pi_\theta(a|s) = \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s)

This lets us estimate gradients through sampling without knowing environment dynamics.

REINFORCE Algorithm

The simplest policy gradient:

for episode in episodes:
    states, actions, rewards = collect_trajectory(policy)
    returns = compute_returns(rewards, gamma)
    
    loss = 0
    for s, a, R in zip(states, actions, returns):
        loss -= log_prob(policy(s), a) * R
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Interactive Visualization

Watch how policy gradients update a simple policy:

Policy Gradient Learning

Episode: 0

Goal: Learn that action "→" gives reward. Policy starts uniform, then learns.

25%
25%
25%
25%

Update rule: Increase π(a|s) when action a gets positive advantage, decrease when negative. ∇log π(a|s) × advantage.

High Variance Problem

REINFORCE has high variance because:

  1. Returns vary widely across episodes
  2. Credit assignment is imprecise (which action caused the reward?)

Variance Reduction: Baselines

Subtract a baseline b(s)b(s) from returns:

θJ(θ)=E[θlogπθ(atst)(Rtb(st))]\nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (R_t - b(s_t))\right]

This doesn’t change the expected gradient but reduces variance. Common choice: b(s)=V(s)b(s) = V(s).

Actor-Critic Methods

Use a learned value function as baseline:

  • Actor: Policy πθ(as)\pi_\theta(a|s)
  • Critic: Value function Vϕ(s)V_\phi(s)
θJ(θ)θlogπθ(atst)(rt+γVϕ(st+1)Vϕ(st))\nabla_\theta J(\theta) \approx \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (r_t + \gamma V_\phi(s_{t+1}) - V_\phi(s_t))

The advantage At=rt+γV(st+1)V(st)A_t = r_t + \gamma V(s_{t+1}) - V(s_t) tells us if the action was better or worse than average.

Key Algorithms

AlgorithmKey Idea
REINFORCEVanilla policy gradient
A2CActor-critic with advantage
A3CAsynchronous parallel training
PPOClipped surrogate objective
TRPOTrust region constraint

Advantages of Policy Gradients

  1. Continuous actions: Natural parameterization
  2. Stochastic policies: Better exploration
  3. Direct optimization: Optimize what you care about
  4. Convergence: Guaranteed to local optimum

When to Use

  • Continuous action spaces
  • When you need stochastic policies
  • When state representation is rich enough
Found an error or want to contribute? Edit on GitHub