Recurrent Neural Network Regularization

How to apply dropout to LSTMs without disrupting memory dynamics

Recurrent Neural Network Regularization solved a critical problem: standard dropout doesn’t work well in RNNs. This paper by Zaremba, Sutskever, and Vinyals showed how to apply dropout correctly to LSTMs.

The Problem

Naively applying dropout to recurrent connections causes issues:

ht=f(Whdropout(ht1)+Wxxt)h_t = f(W_h \cdot \text{dropout}(h_{t-1}) + W_x \cdot x_t)

The dropout noise accumulates over time, destroying the network’s ability to retain long-term information.

The Solution

Apply dropout only to non-recurrent connections:

ht=f(Whht1+Wxdropout(xt))h_t = f(W_h \cdot h_{t-1} + W_x \cdot \text{dropout}(x_t)) yt=dropout(ht)y_t = \text{dropout}(h_t)

The recurrent path ht1hth_{t-1} \rightarrow h_t remains clean, preserving memory.

Where to Apply Dropout

ConnectionApply Dropout?
Input → Hidden✓ Yes
Hidden → Hidden (recurrent)✗ No
Hidden → Output✓ Yes
Between LSTM layers✓ Yes

Interactive Demo

Visualize how dropout randomly masks neurons during training:

Dropout Regularization

Dropout Rate: 50%
InputHidden 1Hidden 2Output
Training
Randomly zero neurons with probability p. Scale remaining by 1/(1-p).
Inference
Use all neurons. No scaling needed (inverted dropout).
Why Dropout Works
• Prevents co-adaptation: neurons can't rely on specific others
• Implicit ensemble: trains exponentially many sub-networks
• Noise injection: adds regularization similar to data augmentation

LSTM-Specific Details

For multi-layer LSTMs, dropout is applied:

Layer 1: dropout(x) → LSTM → h1
Layer 2: dropout(h1) → LSTM → h2
Layer 3: dropout(h2) → LSTM → h3
Output:  dropout(h3) → Linear → y

Within each LSTM cell, the recurrent connections remain untouched.

Results

On Penn Treebank word-level language modeling:

ModelPerplexity
LSTM (no dropout)120.7
LSTM (naive dropout)Diverges
LSTM (this paper)78.4

Proper dropout reduces perplexity by 35%.

Mathematical Formulation

During training with dropout rate pp:

h~tl=dropout(htl,p)11p\tilde{h}_t^l = \text{dropout}(h_t^l, p) \cdot \frac{1}{1-p}

The scaling factor ensures expected values match at test time.

Key Insights

  1. Dropout placement matters: Recurrent connections need clean gradients
  2. Same mask per sequence: Use consistent dropout mask across timesteps
  3. Higher dropout for deeper networks: More layers allow more regularization

Legacy

This paper established best practices for RNN regularization that remained standard until Transformers. The insight that different connection types need different treatment influenced later work on attention dropout and layer normalization.

Key Paper

Found an error or want to contribute? Edit on GitHub