Neural Machine Translation by Jointly Learning to Align and Translate

The paper that introduced the attention mechanism for sequence-to-sequence models

This landmark paper introduced the attention mechanism for neural machine translation, solving the bottleneck problem of fixed-size encoder representations. It’s one of the most influential papers in modern AI.

The Bottleneck Problem

In standard seq2seq models, the encoder compresses the entire input into a fixed-size context vector cc:

c=f(x1,x2,...,xT)c = f(x_1, x_2, ..., x_T)

This becomes a bottleneck for long sequences—all information must squeeze through a single vector.

The Attention Solution

Instead of a single context vector, attention computes a different context for each output step:

ci=j=1Tαijhjc_i = \sum_{j=1}^{T} \alpha_{ij} h_j

where αij\alpha_{ij} is the attention weight that output position ii places on input position jj.

Computing Attention Weights

The attention weights are computed by an alignment model:

eij=a(si1,hj)e_{ij} = a(s_{i-1}, h_j) αij=exp(eij)k=1Texp(eik)\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T} \exp(e_{ik})}

The alignment function aa (often a small neural network) scores how well input jj and output ii match.

Interactive Demo

Watch attention align source and target words during translation:

Attention Alignment

Step 1/5
Je
suis
étudiant
.
I
am
a
student
.
Source (French) → Target (English)
Current Output: "I"
Je: 78%
suis: 14%
étudiant: 2%
.: 6%
With Attention
Model learns which source words are relevant for each output word
Without Attention
Entire source compressed into fixed-size vector—bottleneck

The Alignment Model

Bahdanau attention uses an additive alignment function:

eij=vTtanh(Wssi1+Whhj)e_{ij} = v^T \tanh(W_s s_{i-1} + W_h h_j)

This learnable function discovers which source words are relevant for generating each target word.

Why This Changed Everything

  1. No bottleneck: Each output step accesses all encoder states
  2. Interpretable: Attention weights show what the model “looks at”
  3. Handles long sequences: Performance doesn’t degrade with length
  4. Generalizable: The mechanism applies far beyond translation

Results

ModelBLEU Score
RNNsearch (no attention)26.75
RNNsearch-50 (attention)28.45
Phrase-based SMT33.30

Attention closed the gap with statistical MT and enabled future improvements.

Legacy

This paper’s attention mechanism became:

  • The foundation of Transformers (self-attention)
  • Used in image captioning (visual attention)
  • Core to speech recognition (CTC-attention)
  • Essential for modern LLMs

Key Paper

Found an error or want to contribute? Edit on GitHub