The Annotated Transformer

The Annotated Transformer is a line-by-line guide to implementing the Transformer architecture in PyTorch. Created by Harvard NLP, it makes the seminal “Attention Is All You Need” paper concrete and reproducible.

Why Code Matters

The original Transformer paper describes the architecture mathematically. The Annotated Transformer shows exactly how those equations become working code:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

becomes:

scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
p_attn = scores.softmax(dim=-1)
return torch.matmul(p_attn, value)

Core Components

The full Transformer in ~400 lines breaks down into six key pieces:

Embeddings + Positional Encoding — Token lookup + position information
Multi-Head Attention — Parallel attention heads
Feed-Forward Network — Position-wise MLP
Encoder Layer — Self-attention + FFN with residuals
Decoder Layer — Masked self-attention + cross-attention + FFN
Generator — Project to vocabulary for prediction

Interactive Code Explorer

Click through the architecture to see the implementation of each component:

The Annotated Transformer

Full article →

Embed + PE

Encoder ×N

Decoder ×N

Generator

Multi-Head Attention

Scaled dot-product attention with optional masking for decoder

def attention(query, key, value, mask=None):
    scores = torch.matmul(query, key.transpose(-2, -1))
    scores = scores / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = scores.softmax(dim=-1)
    return torch.matmul(p_attn, value), p_attn

~400

Lines of PyTorch

Core Components

100%

Annotated

Why This Resource Matters

The Annotated Transformer bridges theory and practice—seeing the actual code alongside explanations makes the architecture concrete and reproducible.

Key Implementation Details

Scaled Dot-Product Attention

def attention(query, key, value, mask=None, dropout=None):
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = scores.softmax(dim=-1)
    return torch.matmul(p_attn, value), p_attn

Sublayer Connection (Residual + LayerNorm)

class SublayerConnection(nn.Module):
    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

Learning Path

The Annotated Transformer is best approached in order:

Embeddings — How tokens become vectors
Attention — The core mechanism
Multi-Head — Parallel attention
Encoder/Decoder — Full architecture
Training — Label smoothing, optimizer

Resources

Full Article: https://nlp.seas.harvard.edu/annotated-transformer/
GitHub: https://github.com/harvardnlp/annotated-transformer
Original Paper: “Attention Is All You Need” (Vaswani et al., 2017)