The paper that introduced the attention mechanism for sequence-to-sequence models
This landmark paper introduced the attention mechanism for neural machine translation, solving the bottleneck problem of fixed-size encoder representations. It’s one of the most influential papers in modern AI.
The Bottleneck Problem
In standard seq2seq models, the encoder compresses the entire input into a fixed-size context vector :
This becomes a bottleneck for long sequences—all information must squeeze through a single vector.
The Attention Solution
Instead of a single context vector, attention computes a different context for each output step:
where is the attention weight that output position places on input position .
Computing Attention Weights
The attention weights are computed by an alignment model:
The alignment function (often a small neural network) scores how well input and output match.
Interactive Demo
Watch attention align source and target words during translation:
Attention Alignment
The Alignment Model
Bahdanau attention uses an additive alignment function:
This learnable function discovers which source words are relevant for generating each target word.
Why This Changed Everything
- No bottleneck: Each output step accesses all encoder states
- Interpretable: Attention weights show what the model “looks at”
- Handles long sequences: Performance doesn’t degrade with length
- Generalizable: The mechanism applies far beyond translation
Results
| Model | BLEU Score |
|---|---|
| RNNsearch (no attention) | 26.75 |
| RNNsearch-50 (attention) | 28.45 |
| Phrase-based SMT | 33.30 |
Attention closed the gap with statistical MT and enabled future improvements.
Legacy
This paper’s attention mechanism became:
- The foundation of Transformers (self-attention)
- Used in image captioning (visual attention)
- Core to speech recognition (CTC-attention)
- Essential for modern LLMs
Key Paper
- Neural Machine Translation by Jointly Learning to Align and Translate — Bahdanau, Cho, Bengio (2014)
https://arxiv.org/abs/1409.0473