How to apply dropout to LSTMs without disrupting memory dynamics
Recurrent Neural Network Regularization solved a critical problem: standard dropout doesn’t work well in RNNs. This paper by Zaremba, Sutskever, and Vinyals showed how to apply dropout correctly to LSTMs.
The Problem
Naively applying dropout to recurrent connections causes issues:
The dropout noise accumulates over time, destroying the network’s ability to retain long-term information.
The Solution
Apply dropout only to non-recurrent connections:
The recurrent path remains clean, preserving memory.
Where to Apply Dropout
| Connection | Apply Dropout? |
|---|---|
| Input → Hidden | ✓ Yes |
| Hidden → Hidden (recurrent) | ✗ No |
| Hidden → Output | ✓ Yes |
| Between LSTM layers | ✓ Yes |
Interactive Demo
Visualize how dropout randomly masks neurons during training:
Dropout Regularization
LSTM-Specific Details
For multi-layer LSTMs, dropout is applied:
Layer 1: dropout(x) → LSTM → h1
Layer 2: dropout(h1) → LSTM → h2
Layer 3: dropout(h2) → LSTM → h3
Output: dropout(h3) → Linear → y
Within each LSTM cell, the recurrent connections remain untouched.
Results
On Penn Treebank word-level language modeling:
| Model | Perplexity |
|---|---|
| LSTM (no dropout) | 120.7 |
| LSTM (naive dropout) | Diverges |
| LSTM (this paper) | 78.4 |
Proper dropout reduces perplexity by 35%.
Mathematical Formulation
During training with dropout rate :
The scaling factor ensures expected values match at test time.
Key Insights
- Dropout placement matters: Recurrent connections need clean gradients
- Same mask per sequence: Use consistent dropout mask across timesteps
- Higher dropout for deeper networks: More layers allow more regularization
Legacy
This paper established best practices for RNN regularization that remained standard until Transformers. The insight that different connection types need different treatment influenced later work on attention dropout and layer normalization.
Key Paper
- Recurrent Neural Network Regularization — Zaremba, Sutskever, Vinyals (2014)
https://arxiv.org/abs/1409.2329