Understanding LSTM Networks

Christopher Olah's visual guide to Long Short-Term Memory networks

Understanding LSTM Networks is Christopher Olah’s influential blog post that provides an intuitive visual explanation of how LSTMs work. It’s become the standard introduction to recurrent architectures.

The Problem with Vanilla RNNs

Standard RNNs struggle with long-range dependencies. To predict the word in “I grew up in France… I speak fluent ___”, the model needs information from many steps ago.

In theory, RNNs can learn such dependencies. In practice, gradients either:

  • Vanish: Shrink exponentially, preventing learning of long-range patterns
  • Explode: Grow exponentially, causing unstable training

The LSTM Solution

LSTMs introduce a cell state—a highway that allows information to flow across many timesteps with minimal modification. Gates control what information enters, leaves, or stays in this cell state.

The Four Components

1. Forget Gate

Decides what to discard from cell state:

ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)

Output near 0 = forget, near 1 = keep.

2. Input Gate

Decides which values to update:

it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)

3. Cell Candidate

Creates new candidate values to potentially add:

C~t=tanh(WC[ht1,xt]+bC)\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)

4. Output Gate

Decides what to output based on cell state:

ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)

Cell State Update

The cell state update combines forgetting and adding:

Ct=ftCt1+itC~tC_t = f_t \ast C_{t-1} + i_t \ast \tilde{C}_t

Interactive Demo

Explore each gate and watch information flow through the LSTM cell:

LSTM Cell

Cell State (Cₜ₋₁ → Cₜ)×Forget×InputtanhCell+×OutputHidden State (hₜ₋₁ → hₜ)
Cell State
The "memory highway"—information can flow unchanged across many timesteps
Gates
Sigmoid (σ) outputs 0-1, controlling how much information passes through

Why LSTMs Work

The cell state acts as a gradient highway. During backpropagation:

CtCt1=ft\frac{\partial C_t}{\partial C_{t-1}} = f_t

If the forget gate is close to 1, gradients flow unchanged. This solves the vanishing gradient problem for relevant information.

The Intuition

Think of an LSTM as a conveyor belt (cell state) with workers (gates):

  • Forget gate: Workers removing items from the belt
  • Input gate: Workers adding new items
  • Output gate: Workers taking items off for current use

The belt keeps moving; workers only modify what’s necessary.

Variants

  • GRU (Gated Recurrent Unit): Combines forget and input gates, simpler
  • Peephole Connections: Gates can look at cell state directly
  • Bidirectional LSTM: Process sequence in both directions

Key Resource

Found an error or want to contribute? Edit on GitHub