Reinforcement Learning

Learning optimal behavior through interaction with an environment

Reinforcement Learning (RL) is the study of agents learning to make decisions by interacting with an environment. Unlike supervised learning, RL learns from rewards and consequences rather than labeled examples.

The RL Framework

An agent interacts with an environment in discrete timesteps:

  1. Agent observes state sts_t
  2. Agent takes action ata_t
  3. Environment returns reward rtr_t and next state st+1s_{t+1}
  4. Repeat

The goal: learn a policy π(as)\pi(a|s) that maximizes cumulative reward.

Markov Decision Process (MDP)

RL problems are formalized as MDPs:

  • States S\mathcal{S}: Possible situations
  • Actions A\mathcal{A}: Available choices
  • Transition P(ss,a)P(s'|s,a): Environment dynamics
  • Reward R(s,a,s)R(s,a,s'): Feedback signal
  • Discount γ[0,1]\gamma \in [0,1]: Future reward weighting

The Objective

Maximize expected discounted return:

J(π)=Eτπ[t=0γtrt]J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]

where τ=(s0,a0,r0,s1,a1,r1,)\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots) is a trajectory.

Value Functions

State value: Expected return from state ss under policy π\pi:

Vπ(s)=Eπ[t=0γtrts0=s]V^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s\right]

Action value (Q-function): Expected return from taking action aa in state ss:

Qπ(s,a)=Eπ[t=0γtrts0=s,a0=a]Q^\pi(s, a) = \mathbb{E}_\pi\left[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a\right]

Bellman Equations

Value functions satisfy recursive relationships:

Vπ(s)=aπ(as)sP(ss,a)[R(s,a,s)+γVπ(s)]V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')] Qπ(s,a)=sP(ss,a)[R(s,a,s)+γaπ(as)Qπ(s,a)]Q^\pi(s,a) = \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')]

Interactive Visualization

Watch an agent learn to navigate through trial and error:

Q-Learning Agent

Episode: 0
🎯
Episode Reward
0.0
Exploration ε
1.00
Agent learns Q-values through trial & error. Brighter cells = higher expected reward.

Two Approaches

Value-based: Learn Q(s,a)Q^*(s,a), act greedily

  • Q-learning, DQN
  • Works well for discrete actions

Policy-based: Learn πθ(as)\pi_\theta(a|s) directly

  • Policy gradient, PPO
  • Handles continuous actions

Key Algorithms

AlgorithmTypeKey Idea
Q-LearningValueOff-policy TD learning
DQNValueNeural network Q-function
REINFORCEPolicyMonte Carlo policy gradient
A2C/A3CActor-CriticValue baseline reduces variance
PPOPolicyClipped surrogate objective
SACActor-CriticMaximum entropy RL

Exploration vs Exploitation

A fundamental tradeoff:

  • Explore: Try new actions to discover better strategies
  • Exploit: Use current knowledge to maximize reward

Solutions: ε-greedy, UCB, entropy bonuses, curiosity-driven exploration.

Found an error or want to contribute? Edit on GitHub