Reinforcement Learning

Reinforcement Learning (RL) is about learning by trial and error. An agent takes actions, sees what happens, and gets rewards or penalties. Over time, it learns which actions lead to better long-term outcomes.

Good follow-up pages: Policy Gradient, DQN, PPO, and RLHF.

A Simple Example

Imagine training a robot to leave a maze:

every move costs -1
reaching the exit gives +100
falling into a trap gives -50

The robot is not told the correct path in advance. It has to explore, make mistakes, and gradually discover a strategy that earns the highest total reward.

The RL Loop

At each step:

The agent observes the current state $s_t$
It chooses an action $a_t$
The environment returns a reward $r_t$ and a new state $s_{t+1}$
The process repeats

The goal is to learn a policy $\pi(a|s)$ : a rule for choosing actions that maximizes long-term reward.

Markov Decision Process (MDP)

RL problems are usually written as a Markov Decision Process:

States $\mathcal{S}$ : all situations the agent might be in
Actions $\mathcal{A}$ : what the agent can do
Transition $P(s'|s,a)$ : how the world changes after an action
Reward $R(s,a,s')$ : how good or bad that outcome is
Discount $\gamma$ : how much future rewards matter

If that notation is new, the main idea is still simple: state -> action -> consequence -> reward.

The Objective

RL tries to maximize the expected discounted return:

J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]

That expression just means: add up rewards over time, with optional discounting so near-term rewards count more than distant ones.

Value Functions

Two functions show up everywhere in RL:

State value $V^\pi(s)$ : how good it is to be in state $s$
Action value $Q^\pi(s, a)$ : how good it is to take action $a$ in state $s$

They help the agent estimate long-term consequences before it has seen every possible future.

V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]

Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]

Bellman Equations

The Bellman equations say that the value of a state equals:

the immediate reward you expect now
plus the value of where you are likely to end up next

That recursive structure is what makes dynamic programming and many RL algorithms possible.

V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]

Interactive Visualization

Watch an agent learn to navigate through trial and error:

Q-Learning Agent

Episode: 0

Episode Reward

0.0

Exploration ε

1.00

Agent learns Q-values through trial & error. Brighter cells = higher expected reward.

Two Big Families of Methods

Value-based methods

learn a score for actions, such as $Q(s,a)$
then pick the action with the highest score
examples: Q-learning, DQN

Policy-based methods

learn the policy directly
work well for continuous or stochastic actions
examples: Policy Gradient, PPO

Key Algorithms

Algorithm	Type	Main idea
Q-Learning	Value	Update action values from experience
DQN	Value	Use a neural network to approximate Q-values
REINFORCE	Policy	Push up actions that led to high reward
PPO	Policy	Improve the policy while limiting unstable updates
SAC	Actor-Critic	Add entropy to encourage exploration

Exploration vs Exploitation

The hardest practical tradeoff in RL is:

Explore: try uncertain actions that might be better
Exploit: use the best strategy found so far

If you only exploit, you may miss better strategies. If you only explore, you never settle on one.

Where RL Shows Up

Game playing
Robotics
Recommender systems
Resource allocation
Language model alignment methods such as RLHF

What To Remember

RL is learning from reward, not labeled answers
The objective is long-term reward, not immediate reward
Most methods either learn values, policies, or both