Attention Is All You Need

The Transformer architecture that replaced recurrence with self-attention

The Transformer is a neural network architecture introduced in Attention Is All You Need by Vaswani et al. (2017). It replaces recurrence and convolution with self-attention, enabling parallel sequence processing and strong modeling of long-range dependencies.

Core Attention

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
  • Query (Q): what the current token is looking for
  • Key (K): what each token offers
  • Value (V): information carried by each token

Each token attends to all others by comparing queries to keys, then combining values using the resulting attention weights.

Multi-Head Attention

Instead of a single attention operation, Transformers use multiple heads:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q,K,V) = \text{Concat}(head_1,\dots,head_h)W^O

where each head computes attention with different learned projections. This allows the model to capture different relational patterns simultaneously.

Positional Encoding

Since attention is order-agnostic, positional encodings inject sequence order information:

PE(pos,2i)=sin(pos100002i/d),PE(pos,2i+1)=cos(pos100002i/d)PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

These fixed encodings allow the model to reason about relative and absolute positions.

Interactive Demo

Explore multi-head attention, token flow, and positional encodings below:

Multi-Head Self-Attention

The
cat
sat
on
the
mat
.
[END]
The
cat
sat
on
the
mat
.
[END]

Positional Encoding

Key Papers

The Architecture

The full Transformer consists of:

Encoder: N layers of (Self-Attention → Add & Norm → FFN → Add & Norm)
Decoder: N layers of (Masked Self-Attention → Cross-Attention → FFN), each with Add & Norm

The original paper used N=6 layers, d_model=512, and 8 attention heads.

Why No Recurrence?

RNNs process sequentially: position tt must wait for t1t-1. Self-attention computes all positions in parallel:

PropertyRNNTransformer
Sequential opsO(n)O(1)
Max path lengthO(n)O(1)
ParallelizableNoYes

Key Insight

Self-attention lets every token interact with every other token in a single layer, making Transformers both expressive and highly parallelizable.

Historical Impact

“Attention Is All You Need” is one of the most cited AI papers. The Transformer architecture underlies:

  • GPT series (decoder-only)
  • BERT (encoder-only)
  • T5, BART (encoder-decoder)
  • Vision Transformers (ViT)
  • Modern LLMs (GPT-4, Claude, etc.)
Found an error or want to contribute? Edit on GitHub