Encoder-decoder architecture for mapping sequences to sequences
Sequence-to-Sequence (Seq2Seq), introduced by Sutskever et al. at Google in 2014, established the encoder-decoder paradigm for mapping variable-length input sequences to variable-length output sequences. It enabled neural machine translation and became the foundation for modern language models.
The Problem
Traditional neural networks require fixed-size inputs and outputs. But many tasks have variable lengths:
- Translation: “Hello” (1 word) → “Bonjour” (1 word)
- Translation: “How are you?” (3 words) → “Comment allez-vous?” (2 words)
The Solution: Encoder-Decoder
Split the problem into two parts:
- Encoder: Read the entire input sequence, compress into a fixed-size vector (the “thought vector” or “context”)
- Decoder: Generate the output sequence from this context vector
Interactive Demo
Watch how Seq2Seq encodes a sentence and decodes its translation:
Seq2Seq: Encoder-Decoder
The Architecture
Encoder (LSTM)
Processes input sequence :
The final hidden state becomes the context vector .
Decoder (LSTM)
Generates output sequence :
The decoder is initialized with .
Key Innovations
1. Reversing Input Sequence
Reversing the source sentence improved translation significantly:
Original: "A B C" → "X Y Z"
Reversed: "C B A" → "X Y Z"
This brings corresponding words closer in terms of RNN distance.
2. Deep LSTMs
Using 4-layer LSTMs significantly outperformed shallow networks:
| Depth | BLEU Score |
|---|---|
| 1 layer | 25.9 |
| 2 layers | 29.6 |
| 4 layers | 34.8 |
3. Beam Search Decoding
Instead of greedy decoding (taking the most likely word at each step), keep the top-k candidates:
Beam size = 3:
Step 1: [The, A, This]
Step 2: [The cat, The dog, A cat, ...]
...select top 3 sequences by total probability
Training
Teacher forcing during training:
Use the ground-truth previous token rather than the model’s prediction.
The Bottleneck Problem
The fixed-size context vector must encode the entire input:
- Works well for short sequences
- Degrades for long sequences (information compression)
- Led to the invention of attention mechanisms
From Seq2Seq to Attention
The limitation of a single context vector motivated Bahdanau attention (2014):
Each decoder step attends to different parts of the encoder output, solving the bottleneck.
Applications
Seq2Seq enabled:
| Task | Input | Output |
|---|---|---|
| Translation | ”Hello world" | "Bonjour le monde” |
| Summarization | Long article | Short summary |
| Dialogue | User message | Response |
| Code generation | Description | Code |
| Speech recognition | Audio features | Text |
Evolution
| Year | Model | Innovation |
|---|---|---|
| 2014 | Seq2Seq | Encoder-decoder RNNs |
| 2014 | Bahdanau | Attention mechanism |
| 2015 | Luong | Simplified attention variants |
| 2017 | Transformer | Self-attention, no recurrence |
| 2018+ | BERT, GPT | Pre-trained transformers |
Why It Mattered
Seq2Seq established:
- End-to-end learning: No hand-crafted features or alignment
- Encoder-decoder paradigm: Used by transformers today
- Variable-length I/O: Fundamental for language tasks
- Transfer learning path: Pre-trained encoders/decoders
Code Example
class Seq2Seq(nn.Module):
def __init__(self, encoder, decoder):
self.encoder = encoder
self.decoder = decoder
def forward(self, src, trg):
# Encode: get context from final hidden state
_, (hidden, cell) = self.encoder(src)
# Decode: generate output sequence
outputs = []
input = trg[0] # <SOS> token
for t in range(1, len(trg)):
output, (hidden, cell) = self.decoder(input, hidden, cell)
outputs.append(output)
input = trg[t] # Teacher forcing
return torch.stack(outputs)
Key Papers
- Sequence to Sequence Learning with Neural Networks – Sutskever et al., 2014
https://arxiv.org/abs/1409.3215 - Learning Phrase Representations using RNN Encoder-Decoder – Cho et al., 2014
https://arxiv.org/abs/1406.1078 - Neural Machine Translation by Jointly Learning to Align and Translate – Bahdanau et al., 2014
https://arxiv.org/abs/1409.0473 - Effective Approaches to Attention-based Neural Machine Translation – Luong et al., 2015
https://arxiv.org/abs/1508.04025