Deep Q-Networks (DQN)

Combining Q-learning with deep neural networks for Atari-level game playing

Deep Q-Networks (DQN) demonstrated that deep reinforcement learning could achieve superhuman performance on Atari games, learning directly from pixels. It was a landmark result that sparked the modern deep RL revolution.

Q-Learning Refresher

Learn the optimal action-value function:

Q(s,a)=E[r+γmaxaQ(s,a)s,a]Q^*(s, a) = \mathbb{E}\left[r + \gamma \max_{a'} Q^*(s', a') \mid s, a\right]

The optimal policy: π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s, a)

The Challenge: Function Approximation

For large state spaces (like images), we can’t store Q-values in a table. Solution: approximate with a neural network:

Q(s,a;θ)Q(s,a)Q(s, a; \theta) \approx Q^*(s, a)

But naive combination of deep learning + Q-learning is unstable!

Key Innovations

1. Experience Replay

Store transitions (s,a,r,s)(s, a, r, s') in a replay buffer and sample random mini-batches:

  • Breaks correlation between consecutive samples
  • Reuses experience efficiently
  • Stabilizes training

2. Target Network

Use a separate network for computing targets:

y=r+γmaxaQ(s,a;θ)y = r + \gamma \max_{a'} Q(s', a'; \theta^-)

where θ\theta^- is a copy of θ\theta updated periodically (e.g., every 10k steps).

  • Prevents moving target problem
  • Dramatically improves stability

The DQN Loss

L(θ)=E(s,a,r,s)D[(yQ(s,a;θ))2]\mathcal{L}(\theta) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}}\left[(y - Q(s, a; \theta))^2\right]

where y=r+γmaxaQ(s,a;θ)y = r + \gamma \max_{a'} Q(s', a'; \theta^-).

Interactive Visualization

Watch DQN learn to estimate Q-values:

DQN Learning

Episode: 0
🎯
Replay Buffer
0/1000
Exploration ε
1.00
Target Network
Updates every 100 steps

DQN innovations: Experience replay (random sampling) + target network (stable targets) = stable deep Q-learning.

Algorithm

Initialize replay buffer D, Q-network θ, target network θ⁻
for episode in range(num_episodes):
    s = env.reset()
    for t in range(max_steps):
        # ε-greedy action selection
        if random() < ε:
            a = random_action()
        else:
            a = argmax_a Q(s, a; θ)
        
        s', r, done = env.step(a)
        D.store(s, a, r, s', done)
        
        # Sample and train
        batch = D.sample(batch_size)
        targets = r + γ * max_a' Q(s', a'; θ⁻) * (1 - done)
        loss = MSE(Q(s, a; θ), targets)
        θ.update(loss)
        
        # Periodic target update
        if t % target_update_freq == 0:
            θ⁻ = θ

Hyperparameters

ParameterTypical Value
Replay buffer size1M transitions
Batch size32
Learning rate0.00025
Discount γ0.99
Target update freq10,000 steps
ε decay1.0 → 0.1 over 1M steps

Improvements (Rainbow DQN)

EnhancementBenefit
Double DQNReduces overestimation bias
Prioritized replayFocus on important transitions
Dueling networksSeparate value and advantage
Multi-step returnsBetter credit assignment
Distributional RLModel return distribution
Noisy networksBetter exploration

Impact

DQN proved that:

  1. Deep RL can learn from high-dimensional sensory input
  2. Experience replay + target networks stabilize training
  3. A single architecture can master diverse tasks

This opened the door to AlphaGo, robotic manipulation, and modern RL research.

Found an error or want to contribute? Edit on GitHub