Recurrent Neural Network Regularization

Recurrent Neural Network Regularization solved a critical problem: standard dropout doesn’t work well in RNNs. This paper by Zaremba, Sutskever, and Vinyals showed how to apply dropout correctly to LSTMs.

The Problem

Naively applying dropout to recurrent connections causes issues:

h_t = f(W_h \cdot \text{dropout}(h_{t-1}) + W_x \cdot x_t)

The dropout noise accumulates over time, destroying the network’s ability to retain long-term information.

The Solution

Apply dropout only to non-recurrent connections:

h_t = f(W_h \cdot h_{t-1} + W_x \cdot \text{dropout}(x_t))

y_t = \text{dropout}(h_t)

The recurrent path $h_{t-1} \rightarrow h_t$ remains clean, preserving memory.

Where to Apply Dropout

Connection	Apply Dropout?
Input → Hidden	✓ Yes
Hidden → Hidden (recurrent)	✗ No
Hidden → Output	✓ Yes
Between LSTM layers	✓ Yes

Interactive Demo

Visualize how dropout randomly masks neurons during training:

Dropout Regularization

Dropout Rate: 50%

Training

Randomly zero neurons with probability p. Scale remaining by 1/(1-p).

Inference

Use all neurons. No scaling needed (inverted dropout).

Why Dropout Works

• Prevents co-adaptation: neurons can't rely on specific others

• Implicit ensemble: trains exponentially many sub-networks

• Noise injection: adds regularization similar to data augmentation

LSTM-Specific Details

For multi-layer LSTMs, dropout is applied:

Layer 1: dropout(x) → LSTM → h1
Layer 2: dropout(h1) → LSTM → h2
Layer 3: dropout(h2) → LSTM → h3
Output:  dropout(h3) → Linear → y

Within each LSTM cell, the recurrent connections remain untouched.

Results

On Penn Treebank word-level language modeling:

Model	Perplexity
LSTM (no dropout)	120.7
LSTM (naive dropout)	Diverges
LSTM (this paper)	78.4

Proper dropout reduces perplexity by 35%.

Mathematical Formulation

During training with dropout rate $p$ :

\tilde{h}_t^l = \text{dropout}(h_t^l, p) \cdot \frac{1}{1-p}

The scaling factor ensures expected values match at test time.

Key Insights

Dropout placement matters: Recurrent connections need clean gradients
Same mask per sequence: Use consistent dropout mask across timesteps
Higher dropout for deeper networks: More layers allow more regularization

Legacy

This paper established best practices for RNN regularization that remained standard until Transformers. The insight that different connection types need different treatment influenced later work on attention dropout and layer normalization.

Key Paper

Recurrent Neural Network Regularization — Zaremba, Sutskever, Vinyals (2014)
https://arxiv.org/abs/1409.2329