The Transformer architecture that replaced recurrence with self-attention
The Transformer is a neural network architecture introduced in Attention Is All You Need by Vaswani et al. (2017). It replaces recurrence and convolution with self-attention, enabling parallel sequence processing and strong modeling of long-range dependencies.
Core Attention
- Query (Q): what the current token is looking for
- Key (K): what each token offers
- Value (V): information carried by each token
Each token attends to all others by comparing queries to keys, then combining values using the resulting attention weights.
Multi-Head Attention
Instead of a single attention operation, Transformers use multiple heads:
where each head computes attention with different learned projections. This allows the model to capture different relational patterns simultaneously.
Positional Encoding
Since attention is order-agnostic, positional encodings inject sequence order information:
These fixed encodings allow the model to reason about relative and absolute positions.
Interactive Demo
Explore multi-head attention, token flow, and positional encodings below:
Multi-Head Self-Attention
Positional Encoding
Key Papers
- Attention Is All You Need – Vaswani et al., 2017
https://arxiv.org/abs/1706.03762 - BERT: Pre-training of Deep Bidirectional Transformers – Devlin et al., 2018
https://arxiv.org/abs/1810.04805 - GPT: Improving Language Understanding by Generative Pre-Training – Radford et al., 2018
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
The Architecture
The full Transformer consists of:
Encoder: N layers of (Self-Attention → Add & Norm → FFN → Add & Norm)
Decoder: N layers of (Masked Self-Attention → Cross-Attention → FFN), each with Add & Norm
The original paper used N=6 layers, d_model=512, and 8 attention heads.
Why No Recurrence?
RNNs process sequentially: position must wait for . Self-attention computes all positions in parallel:
| Property | RNN | Transformer |
|---|---|---|
| Sequential ops | O(n) | O(1) |
| Max path length | O(n) | O(1) |
| Parallelizable | No | Yes |
Key Insight
Self-attention lets every token interact with every other token in a single layer, making Transformers both expressive and highly parallelizable.
Historical Impact
“Attention Is All You Need” is one of the most cited AI papers. The Transformer architecture underlies:
- GPT series (decoder-only)
- BERT (encoder-only)
- T5, BART (encoder-decoder)
- Vision Transformers (ViT)
- Modern LLMs (GPT-4, Claude, etc.)