Line-by-line PyTorch implementation of the Transformer architecture
The Annotated Transformer is a line-by-line guide to implementing the Transformer architecture in PyTorch. Created by Harvard NLP, it makes the seminal “Attention Is All You Need” paper concrete and reproducible.
Why Code Matters
The original Transformer paper describes the architecture mathematically. The Annotated Transformer shows exactly how those equations become working code:
becomes:
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
p_attn = scores.softmax(dim=-1)
return torch.matmul(p_attn, value)
Core Components
The full Transformer in ~400 lines breaks down into six key pieces:
- Embeddings + Positional Encoding — Token lookup + position information
- Multi-Head Attention — Parallel attention heads
- Feed-Forward Network — Position-wise MLP
- Encoder Layer — Self-attention + FFN with residuals
- Decoder Layer — Masked self-attention + cross-attention + FFN
- Generator — Project to vocabulary for prediction
Interactive Code Explorer
Click through the architecture to see the implementation of each component:
The Annotated Transformer
Full article →def attention(query, key, value, mask=None):
scores = torch.matmul(query, key.transpose(-2, -1))
scores = scores / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = scores.softmax(dim=-1)
return torch.matmul(p_attn, value), p_attnKey Implementation Details
Scaled Dot-Product Attention
def attention(query, key, value, mask=None, dropout=None):
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = scores.softmax(dim=-1)
return torch.matmul(p_attn, value), p_attn
Sublayer Connection (Residual + LayerNorm)
class SublayerConnection(nn.Module):
def forward(self, x, sublayer):
return x + self.dropout(sublayer(self.norm(x)))
Learning Path
The Annotated Transformer is best approached in order:
- Embeddings — How tokens become vectors
- Attention — The core mechanism
- Multi-Head — Parallel attention
- Encoder/Decoder — Full architecture
- Training — Label smoothing, optimizer
Why Ilya Included This
Understanding Transformers at the code level reveals:
- How attention actually computes
- Why certain design choices matter (scaling, masking)
- The elegance of the architecture
This hands-on understanding is essential for working with modern AI systems.
Resources
- Full Article: https://nlp.seas.harvard.edu/annotated-transformer/
- GitHub: https://github.com/harvardnlp/annotated-transformer
- Original Paper: “Attention Is All You Need” (Vaswani et al., 2017)