Transformer | AIpedia

The Transformer is a neural network architecture for working with sequences such as text, code, and audio. Its key idea is simple: instead of reading tokens one at a time like an RNN, let every token look at the other relevant tokens and decide what matters most.

If this page feels too fast, read Sequence to Sequence and Bahdanau Attention first. This page teaches the general architecture. For the original 2017 paper, see Attention Is All You Need.

Why Students Should Care

Transformers are the foundation of:

GPT-style models for generation
BERT-style models for understanding
Vision Transformers for images
Most modern LLMs

If you understand the Transformer, you understand the core design behind much of modern AI.

A Quick Example

Take the sentence:

The animal did not cross the street because it was tired.

To understand what it refers to, the model should pay more attention to animal than to street. A Transformer does exactly this: it learns attention weights that tell each token which other tokens to focus on.

Core Attention

In plain English:

Query (Q): what the current token is looking for
Key (K): what each other token can offer
Value (V): the information each token carries

Each token compares its query to the keys of the other tokens. Higher matches get higher attention weights, and those weights are used to combine the values.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

You do not need to memorize the equation. The important idea is: compare, weight, combine.

Multi-Head Attention

A single attention pattern is often too limited. So Transformers use several attention heads in parallel:

\text{MultiHead}(Q,K,V) = \text{Concat}(head_1,\dots,head_h)W^O

Different heads can specialize. One head may focus on nearby words, another on subject-verb agreement, and another on longer-range references.

Positional Encoding

Attention alone does not know word order. Without extra information, dog bites man and man bites dog would look like the same bag of tokens.

Transformers fix this by adding a position signal to each token embedding:

PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

The equation is less important than the purpose: tell the model where each token sits in the sequence.

Interactive Demo

Explore multi-head attention, token flow, and positional encodings below:

Multi-Head Self-Attention

The

cat

sat

the

mat

[END]

The

cat

sat

the

mat

[END]

Positional Encoding

The Architecture

The original Transformer had two parts:

Encoder: reads the input and builds contextual representations
Decoder: generates the output one token at a time

At a high level:

Encoder block = self-attention -> add and norm -> feed-forward network
Decoder block = masked self-attention -> cross-attention -> feed-forward network

Many modern LLMs use only the decoder side because text generation is the main task.

Why It Beat RNNs

RNNs must process tokens one after another. Transformers can process all tokens in parallel.

Property	RNN	Transformer
Sequential ops	O(n)	O(1)
Max path length	O(n)	O(1)
Parallelizable	No	Yes

That made Transformers faster to train and better at connecting distant parts of a sequence.

Common Confusion

Transformer is the architecture.
Attention Is All You Need is the paper that introduced it.
Self-attention is one mechanism inside the architecture, not the whole model.

Where To Go Next

Read Attention Is All You Need for the original paper and results
Read BERT for encoder-only Transformers
Read GPT for decoder-only Transformers
Read Vision Transformer for image applications

Key Papers

Attention Is All You Need - Vaswani et al., 2017
https://arxiv.org/abs/1706.03762
BERT: Pre-training of Deep Bidirectional Transformers - Devlin et al., 2018
https://arxiv.org/abs/1810.04805
GPT: Improving Language Understanding by Generative Pre-Training - Radford et al., 2018
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf