Transformer

Self-attention models that process sequences in parallel

The Transformer is a neural network architecture for working with sequences such as text, code, and audio. Its key idea is simple: instead of reading tokens one at a time like an RNN, let every token look at the other relevant tokens and decide what matters most.

If this page feels too fast, read Sequence to Sequence and Bahdanau Attention first. This page teaches the general architecture. For the original 2017 paper, see Attention Is All You Need.

Why Students Should Care

Transformers are the foundation of:

  • GPT-style models for generation
  • BERT-style models for understanding
  • Vision Transformers for images
  • Most modern LLMs

If you understand the Transformer, you understand the core design behind much of modern AI.

A Quick Example

Take the sentence:

The animal did not cross the street because it was tired.

To understand what it refers to, the model should pay more attention to animal than to street. A Transformer does exactly this: it learns attention weights that tell each token which other tokens to focus on.

Core Attention

In plain English:

  • Query (Q): what the current token is looking for
  • Key (K): what each other token can offer
  • Value (V): the information each token carries

Each token compares its query to the keys of the other tokens. Higher matches get higher attention weights, and those weights are used to combine the values.

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

You do not need to memorize the equation. The important idea is: compare, weight, combine.

Multi-Head Attention

A single attention pattern is often too limited. So Transformers use several attention heads in parallel:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q,K,V) = \text{Concat}(head_1,\dots,head_h)W^O

Different heads can specialize. One head may focus on nearby words, another on subject-verb agreement, and another on longer-range references.

Positional Encoding

Attention alone does not know word order. Without extra information, dog bites man and man bites dog would look like the same bag of tokens.

Transformers fix this by adding a position signal to each token embedding:

PE(pos,2i)=sin(pos100002i/d),PE(pos,2i+1)=cos(pos100002i/d)PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

The equation is less important than the purpose: tell the model where each token sits in the sequence.

Interactive Demo

Explore multi-head attention, token flow, and positional encodings below:

Multi-Head Self-Attention

The
cat
sat
on
the
mat
.
[END]
The
cat
sat
on
the
mat
.
[END]

Positional Encoding

The Architecture

The original Transformer had two parts:

  • Encoder: reads the input and builds contextual representations
  • Decoder: generates the output one token at a time

At a high level:

  • Encoder block = self-attention -> add and norm -> feed-forward network
  • Decoder block = masked self-attention -> cross-attention -> feed-forward network

Many modern LLMs use only the decoder side because text generation is the main task.

Why It Beat RNNs

RNNs must process tokens one after another. Transformers can process all tokens in parallel.

PropertyRNNTransformer
Sequential opsO(n)O(1)
Max path lengthO(n)O(1)
ParallelizableNoYes

That made Transformers faster to train and better at connecting distant parts of a sequence.

Common Confusion

  • Transformer is the architecture.
  • Attention Is All You Need is the paper that introduced it.
  • Self-attention is one mechanism inside the architecture, not the whole model.

Where To Go Next

Key Papers

Found an error or want to contribute? Edit on GitHub