Christopher Olah's visual guide to Long Short-Term Memory networks
Understanding LSTM Networks is Christopher Olah’s influential blog post that provides an intuitive visual explanation of how LSTMs work. It’s become the standard introduction to recurrent architectures.
The Problem with Vanilla RNNs
Standard RNNs struggle with long-range dependencies. To predict the word in “I grew up in France… I speak fluent ___”, the model needs information from many steps ago.
In theory, RNNs can learn such dependencies. In practice, gradients either:
- Vanish: Shrink exponentially, preventing learning of long-range patterns
- Explode: Grow exponentially, causing unstable training
The LSTM Solution
LSTMs introduce a cell state—a highway that allows information to flow across many timesteps with minimal modification. Gates control what information enters, leaves, or stays in this cell state.
The Four Components
1. Forget Gate
Decides what to discard from cell state:
Output near 0 = forget, near 1 = keep.
2. Input Gate
Decides which values to update:
3. Cell Candidate
Creates new candidate values to potentially add:
4. Output Gate
Decides what to output based on cell state:
Cell State Update
The cell state update combines forgetting and adding:
Interactive Demo
Explore each gate and watch information flow through the LSTM cell:
LSTM Cell
Why LSTMs Work
The cell state acts as a gradient highway. During backpropagation:
If the forget gate is close to 1, gradients flow unchanged. This solves the vanishing gradient problem for relevant information.
The Intuition
Think of an LSTM as a conveyor belt (cell state) with workers (gates):
- Forget gate: Workers removing items from the belt
- Input gate: Workers adding new items
- Output gate: Workers taking items off for current use
The belt keeps moving; workers only modify what’s necessary.
Variants
- GRU (Gated Recurrent Unit): Combines forget and input gates, simpler
- Peephole Connections: Gates can look at cell state directly
- Bidirectional LSTM: Process sequence in both directions