Sequence to Sequence Learning

Encoder-decoder architecture for mapping sequences to sequences

Sequence-to-Sequence (Seq2Seq), introduced by Sutskever et al. at Google in 2014, established the encoder-decoder paradigm for mapping variable-length input sequences to variable-length output sequences. It enabled neural machine translation and became the foundation for modern language models.

The Problem

Traditional neural networks require fixed-size inputs and outputs. But many tasks have variable lengths:

  • Translation: “Hello” (1 word) → “Bonjour” (1 word)
  • Translation: “How are you?” (3 words) → “Comment allez-vous?” (2 words)

The Solution: Encoder-Decoder

Split the problem into two parts:

  1. Encoder: Read the entire input sequence, compress into a fixed-size vector (the “thought vector” or “context”)
  2. Decoder: Generate the output sequence from this context vector
InputEncodercDecoderOutput\text{Input} \xrightarrow{\text{Encoder}} \mathbf{c} \xrightarrow{\text{Decoder}} \text{Output}

Interactive Demo

Watch how Seq2Seq encodes a sentence and decodes its translation:

Seq2Seq: Encoder-Decoder

1. Encode
2. Context
3. Decode
Encoder (LSTM)
The
h1
cat
h2
sat
h3
<EOS>
h4
Context
"Thought Vector"
Decoder (LSTM)
s1
<SOS>
s2
Le
s3
chat
s4
assis
s5
<EOS>
Press "Start" to see how Seq2Seq translates "The cat sat" to "Le chat assis".
Input Processing
Often reversed ("sat cat The") to reduce distance between corresponding words
The Bottleneck
Fixed-size context limits long sequences → Led to attention mechanisms

The Architecture

Encoder (LSTM)

Processes input sequence x1,x2,...,xTx_1, x_2, ..., x_T:

ht=LSTM(xt,ht1)h_t = \text{LSTM}(x_t, h_{t-1})

The final hidden state hTh_T becomes the context vector c\mathbf{c}.

Decoder (LSTM)

Generates output sequence y1,y2,...,yTy_1, y_2, ..., y_{T'}:

st=LSTM(yt1,st1)s_t = \text{LSTM}(y_{t-1}, s_{t-1}) P(yty<t,c)=softmax(Wsst)P(y_t | y_{<t}, \mathbf{c}) = \text{softmax}(W_s \cdot s_t)

The decoder is initialized with s0=cs_0 = \mathbf{c}.

Key Innovations

1. Reversing Input Sequence

Reversing the source sentence improved translation significantly:

Original: "A B C" → "X Y Z"
Reversed: "C B A" → "X Y Z"

This brings corresponding words closer in terms of RNN distance.

2. Deep LSTMs

Using 4-layer LSTMs significantly outperformed shallow networks:

DepthBLEU Score
1 layer25.9
2 layers29.6
4 layers34.8

3. Beam Search Decoding

Instead of greedy decoding (taking the most likely word at each step), keep the top-k candidates:

Beam size = 3:
Step 1: [The, A, This]
Step 2: [The cat, The dog, A cat, ...]
...select top 3 sequences by total probability

Training

Teacher forcing during training:

L=t=1TlogP(yty1,...,yt1,c)\mathcal{L} = -\sum_{t=1}^{T'} \log P(y_t^* | y_1^*, ..., y_{t-1}^*, \mathbf{c})

Use the ground-truth previous token yt1y_{t-1}^* rather than the model’s prediction.

The Bottleneck Problem

The fixed-size context vector c\mathbf{c} must encode the entire input:

  • Works well for short sequences
  • Degrades for long sequences (information compression)
  • Led to the invention of attention mechanisms

From Seq2Seq to Attention

The limitation of a single context vector motivated Bahdanau attention (2014):

ct=iαtihi\mathbf{c}_t = \sum_i \alpha_{ti} h_i

Each decoder step attends to different parts of the encoder output, solving the bottleneck.

Applications

Seq2Seq enabled:

TaskInputOutput
Translation”Hello world""Bonjour le monde”
SummarizationLong articleShort summary
DialogueUser messageResponse
Code generationDescriptionCode
Speech recognitionAudio featuresText

Evolution

YearModelInnovation
2014Seq2SeqEncoder-decoder RNNs
2014BahdanauAttention mechanism
2015LuongSimplified attention variants
2017TransformerSelf-attention, no recurrence
2018+BERT, GPTPre-trained transformers

Why It Mattered

Seq2Seq established:

  • End-to-end learning: No hand-crafted features or alignment
  • Encoder-decoder paradigm: Used by transformers today
  • Variable-length I/O: Fundamental for language tasks
  • Transfer learning path: Pre-trained encoders/decoders

Code Example

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, src, trg):
        # Encode: get context from final hidden state
        _, (hidden, cell) = self.encoder(src)

        # Decode: generate output sequence
        outputs = []
        input = trg[0]  # <SOS> token
        for t in range(1, len(trg)):
            output, (hidden, cell) = self.decoder(input, hidden, cell)
            outputs.append(output)
            input = trg[t]  # Teacher forcing

        return torch.stack(outputs)

Key Papers

Found an error or want to contribute? Edit on GitHub