Learning optimal behavior through interaction with an environment
Reinforcement Learning (RL) is the study of agents learning to make decisions by interacting with an environment. Unlike supervised learning, RL learns from rewards and consequences rather than labeled examples.
The RL Framework
An agent interacts with an environment in discrete timesteps:
- Agent observes state
- Agent takes action
- Environment returns reward and next state
- Repeat
The goal: learn a policy that maximizes cumulative reward.
Markov Decision Process (MDP)
RL problems are formalized as MDPs:
- States : Possible situations
- Actions : Available choices
- Transition : Environment dynamics
- Reward : Feedback signal
- Discount : Future reward weighting
The Objective
Maximize expected discounted return:
where is a trajectory.
Value Functions
State value: Expected return from state under policy :
Action value (Q-function): Expected return from taking action in state :
Bellman Equations
Value functions satisfy recursive relationships:
Interactive Visualization
Watch an agent learn to navigate through trial and error:
Q-Learning Agent
Episode: 0Two Approaches
Value-based: Learn , act greedily
- Q-learning, DQN
- Works well for discrete actions
Policy-based: Learn directly
- Policy gradient, PPO
- Handles continuous actions
Key Algorithms
| Algorithm | Type | Key Idea |
|---|---|---|
| Q-Learning | Value | Off-policy TD learning |
| DQN | Value | Neural network Q-function |
| REINFORCE | Policy | Monte Carlo policy gradient |
| A2C/A3C | Actor-Critic | Value baseline reduces variance |
| PPO | Policy | Clipped surrogate objective |
| SAC | Actor-Critic | Maximum entropy RL |
Exploration vs Exploitation
A fundamental tradeoff:
- Explore: Try new actions to discover better strategies
- Exploit: Use current knowledge to maximize reward
Solutions: ε-greedy, UCB, entropy bonuses, curiosity-driven exploration.