Self-attention models that process sequences in parallel
The Transformer is a neural network architecture for working with sequences such as text, code, and audio. Its key idea is simple: instead of reading tokens one at a time like an RNN, let every token look at the other relevant tokens and decide what matters most.
If this page feels too fast, read Sequence to Sequence and Bahdanau Attention first. This page teaches the general architecture. For the original 2017 paper, see Attention Is All You Need.
Why Students Should Care
Transformers are the foundation of:
- GPT-style models for generation
- BERT-style models for understanding
- Vision Transformers for images
- Most modern LLMs
If you understand the Transformer, you understand the core design behind much of modern AI.
A Quick Example
Take the sentence:
The animal did not cross the street because it was tired.
To understand what it refers to, the model should pay more attention to animal than to street. A Transformer does exactly this: it learns attention weights that tell each token which other tokens to focus on.
Core Attention
In plain English:
- Query (Q): what the current token is looking for
- Key (K): what each other token can offer
- Value (V): the information each token carries
Each token compares its query to the keys of the other tokens. Higher matches get higher attention weights, and those weights are used to combine the values.
You do not need to memorize the equation. The important idea is: compare, weight, combine.
Multi-Head Attention
A single attention pattern is often too limited. So Transformers use several attention heads in parallel:
Different heads can specialize. One head may focus on nearby words, another on subject-verb agreement, and another on longer-range references.
Positional Encoding
Attention alone does not know word order. Without extra information, dog bites man and man bites dog would look like the same bag of tokens.
Transformers fix this by adding a position signal to each token embedding:
The equation is less important than the purpose: tell the model where each token sits in the sequence.
Interactive Demo
Explore multi-head attention, token flow, and positional encodings below:
Multi-Head Self-Attention
Positional Encoding
The Architecture
The original Transformer had two parts:
- Encoder: reads the input and builds contextual representations
- Decoder: generates the output one token at a time
At a high level:
- Encoder block = self-attention -> add and norm -> feed-forward network
- Decoder block = masked self-attention -> cross-attention -> feed-forward network
Many modern LLMs use only the decoder side because text generation is the main task.
Why It Beat RNNs
RNNs must process tokens one after another. Transformers can process all tokens in parallel.
| Property | RNN | Transformer |
|---|---|---|
| Sequential ops | O(n) | O(1) |
| Max path length | O(n) | O(1) |
| Parallelizable | No | Yes |
That made Transformers faster to train and better at connecting distant parts of a sequence.
Common Confusion
- Transformer is the architecture.
- Attention Is All You Need is the paper that introduced it.
- Self-attention is one mechanism inside the architecture, not the whole model.
Where To Go Next
- Read Attention Is All You Need for the original paper and results
- Read BERT for encoder-only Transformers
- Read GPT for decoder-only Transformers
- Read Vision Transformer for image applications
Key Papers
- Attention Is All You Need - Vaswani et al., 2017
https://arxiv.org/abs/1706.03762 - BERT: Pre-training of Deep Bidirectional Transformers - Devlin et al., 2018
https://arxiv.org/abs/1810.04805 - GPT: Improving Language Understanding by Generative Pre-Training - Radford et al., 2018
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf