Reinforcement Learning

Learning by trial and error through rewards

Reinforcement Learning (RL) is about learning by trial and error. An agent takes actions, sees what happens, and gets rewards or penalties. Over time, it learns which actions lead to better long-term outcomes.

Good follow-up pages: Policy Gradient, DQN, PPO, and RLHF.

A Simple Example

Imagine training a robot to leave a maze:

  • every move costs -1
  • reaching the exit gives +100
  • falling into a trap gives -50

The robot is not told the correct path in advance. It has to explore, make mistakes, and gradually discover a strategy that earns the highest total reward.

The RL Loop

At each step:

  1. The agent observes the current state sts_t
  2. It chooses an action ata_t
  3. The environment returns a reward rtr_t and a new state st+1s_{t+1}
  4. The process repeats

The goal is to learn a policy π(as)\pi(a|s): a rule for choosing actions that maximizes long-term reward.

Markov Decision Process (MDP)

RL problems are usually written as a Markov Decision Process:

  • States S\mathcal{S}: all situations the agent might be in
  • Actions A\mathcal{A}: what the agent can do
  • Transition P(ss,a)P(s'|s,a): how the world changes after an action
  • Reward R(s,a,s)R(s,a,s'): how good or bad that outcome is
  • Discount γ\gamma: how much future rewards matter

If that notation is new, the main idea is still simple: state -> action -> consequence -> reward.

The Objective

RL tries to maximize the expected discounted return:

J(π)=Eτπ[t=0γtrt]J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]

That expression just means: add up rewards over time, with optional discounting so near-term rewards count more than distant ones.

Value Functions

Two functions show up everywhere in RL:

  • State value Vπ(s)V^\pi(s): how good it is to be in state ss
  • Action value Qπ(s,a)Q^\pi(s, a): how good it is to take action aa in state ss

They help the agent estimate long-term consequences before it has seen every possible future.

Vπ(s)=Eπ[t=0γtrts0=s]V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right] Qπ(s,a)=Eπ[t=0γtrts0=s,a0=a]Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]

Bellman Equations

The Bellman equations say that the value of a state equals:

  • the immediate reward you expect now
  • plus the value of where you are likely to end up next

That recursive structure is what makes dynamic programming and many RL algorithms possible.

Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]

Interactive Visualization

Watch an agent learn to navigate through trial and error:

Q-Learning Agent

Episode: 0
🎯
Episode Reward
0.0
Exploration ε
1.00
Agent learns Q-values through trial & error. Brighter cells = higher expected reward.

Two Big Families of Methods

Value-based methods

  • learn a score for actions, such as Q(s,a)Q(s,a)
  • then pick the action with the highest score
  • examples: Q-learning, DQN

Policy-based methods

  • learn the policy directly
  • work well for continuous or stochastic actions
  • examples: Policy Gradient, PPO

Key Algorithms

AlgorithmTypeMain idea
Q-LearningValueUpdate action values from experience
DQNValueUse a neural network to approximate Q-values
REINFORCEPolicyPush up actions that led to high reward
PPOPolicyImprove the policy while limiting unstable updates
SACActor-CriticAdd entropy to encourage exploration

Exploration vs Exploitation

The hardest practical tradeoff in RL is:

  • Explore: try uncertain actions that might be better
  • Exploit: use the best strategy found so far

If you only exploit, you may miss better strategies. If you only explore, you never settle on one.

Where RL Shows Up

  • Game playing
  • Robotics
  • Recommender systems
  • Resource allocation
  • Language model alignment methods such as RLHF

What To Remember

  • RL is learning from reward, not labeled answers
  • The objective is long-term reward, not immediate reward
  • Most methods either learn values, policies, or both
Found an error or want to contribute? Edit on GitHub