Learning by trial and error through rewards
Reinforcement Learning (RL) is about learning by trial and error. An agent takes actions, sees what happens, and gets rewards or penalties. Over time, it learns which actions lead to better long-term outcomes.
Good follow-up pages: Policy Gradient, DQN, PPO, and RLHF.
A Simple Example
Imagine training a robot to leave a maze:
- every move costs
-1 - reaching the exit gives
+100 - falling into a trap gives
-50
The robot is not told the correct path in advance. It has to explore, make mistakes, and gradually discover a strategy that earns the highest total reward.
The RL Loop
At each step:
- The agent observes the current state
- It chooses an action
- The environment returns a reward and a new state
- The process repeats
The goal is to learn a policy : a rule for choosing actions that maximizes long-term reward.
Markov Decision Process (MDP)
RL problems are usually written as a Markov Decision Process:
- States : all situations the agent might be in
- Actions : what the agent can do
- Transition : how the world changes after an action
- Reward : how good or bad that outcome is
- Discount : how much future rewards matter
If that notation is new, the main idea is still simple: state -> action -> consequence -> reward.
The Objective
RL tries to maximize the expected discounted return:
That expression just means: add up rewards over time, with optional discounting so near-term rewards count more than distant ones.
Value Functions
Two functions show up everywhere in RL:
- State value : how good it is to be in state
- Action value : how good it is to take action in state
They help the agent estimate long-term consequences before it has seen every possible future.
Bellman Equations
The Bellman equations say that the value of a state equals:
- the immediate reward you expect now
- plus the value of where you are likely to end up next
That recursive structure is what makes dynamic programming and many RL algorithms possible.
Interactive Visualization
Watch an agent learn to navigate through trial and error:
Q-Learning Agent
Episode: 0Two Big Families of Methods
Value-based methods
- learn a score for actions, such as
- then pick the action with the highest score
- examples: Q-learning, DQN
Policy-based methods
- learn the policy directly
- work well for continuous or stochastic actions
- examples: Policy Gradient, PPO
Key Algorithms
| Algorithm | Type | Main idea |
|---|---|---|
| Q-Learning | Value | Update action values from experience |
| DQN | Value | Use a neural network to approximate Q-values |
| REINFORCE | Policy | Push up actions that led to high reward |
| PPO | Policy | Improve the policy while limiting unstable updates |
| SAC | Actor-Critic | Add entropy to encourage exploration |
Exploration vs Exploitation
The hardest practical tradeoff in RL is:
- Explore: try uncertain actions that might be better
- Exploit: use the best strategy found so far
If you only exploit, you may miss better strategies. If you only explore, you never settle on one.
Where RL Shows Up
- Game playing
- Robotics
- Recommender systems
- Resource allocation
- Language model alignment methods such as RLHF
What To Remember
- RL is learning from reward, not labeled answers
- The objective is long-term reward, not immediate reward
- Most methods either learn values, policies, or both