The Annotated Transformer

Line-by-line PyTorch implementation of the Transformer architecture

The Annotated Transformer is a line-by-line guide to implementing the Transformer architecture in PyTorch. Created by Harvard NLP, it makes the seminal “Attention Is All You Need” paper concrete and reproducible.

Why Code Matters

The original Transformer paper describes the architecture mathematically. The Annotated Transformer shows exactly how those equations become working code:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

becomes:

scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
p_attn = scores.softmax(dim=-1)
return torch.matmul(p_attn, value)

Core Components

The full Transformer in ~400 lines breaks down into six key pieces:

  1. Embeddings + Positional Encoding — Token lookup + position information
  2. Multi-Head Attention — Parallel attention heads
  3. Feed-Forward Network — Position-wise MLP
  4. Encoder Layer — Self-attention + FFN with residuals
  5. Decoder Layer — Masked self-attention + cross-attention + FFN
  6. Generator — Project to vocabulary for prediction

Interactive Code Explorer

Click through the architecture to see the implementation of each component:

The Annotated Transformer

Full article →
Embed + PE
Encoder ×N
Decoder ×N
Generator
Multi-Head Attention
Scaled dot-product attention with optional masking for decoder
def attention(query, key, value, mask=None):
    scores = torch.matmul(query, key.transpose(-2, -1))
    scores = scores / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = scores.softmax(dim=-1)
    return torch.matmul(p_attn, value), p_attn
~400
Lines of PyTorch
6
Core Components
100%
Annotated
Why This Resource Matters
The Annotated Transformer bridges theory and practice—seeing the actual code alongside explanations makes the architecture concrete and reproducible.

Key Implementation Details

Scaled Dot-Product Attention

def attention(query, key, value, mask=None, dropout=None):
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = scores.softmax(dim=-1)
    return torch.matmul(p_attn, value), p_attn

Sublayer Connection (Residual + LayerNorm)

class SublayerConnection(nn.Module):
    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

Learning Path

The Annotated Transformer is best approached in order:

  1. Embeddings — How tokens become vectors
  2. Attention — The core mechanism
  3. Multi-Head — Parallel attention
  4. Encoder/Decoder — Full architecture
  5. Training — Label smoothing, optimizer

Why Ilya Included This

Understanding Transformers at the code level reveals:

  • How attention actually computes
  • Why certain design choices matter (scaling, masking)
  • The elegance of the architecture

This hands-on understanding is essential for working with modern AI systems.

Resources

Found an error or want to contribute? Edit on GitHub