Attention Is All You Need

The 2017 paper that introduced the Transformer architecture, replacing recurrence with self-attention

Attention Is All You Need (Vaswani et al., 2017) is arguably the most influential deep learning paper of the decade. It introduced the Transformer, an architecture built entirely on attention mechanisms, eliminating the recurrence and convolution that dominated sequence modeling.

The Problem

Before Transformers, sequence-to-sequence models relied on RNNs (LSTMs, GRUs). These had fundamental limitations:

  • Sequential computation: position tt must wait for t1t-1, preventing parallelization
  • Long-range dependencies: information degrades over many steps despite gating mechanisms
  • Training bottleneck: O(n) sequential operations per layer

The paper asked: Can we build a sequence model using only attention?

Core Innovation: Scaled Dot-Product Attention

The fundamental operation of the Transformer:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where:

  • Q (Query): what each position is looking for
  • K (Key): what each position offers to match against
  • V (Value): the actual information to aggregate
  • dk\sqrt{d_k}: scaling factor to prevent softmax saturation

Multi-Head Attention

Rather than a single attention function, the paper uses multiple parallel heads:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Each head can learn different relational patterns—syntactic, semantic, positional—simultaneously.

Positional Encoding

Since attention is permutation-invariant, position information is injected via sinusoidal encodings:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

The authors chose sinusoidal functions because they hypothesized the model could learn to attend by relative position, since PEpos+kPE_{pos+k} can be represented as a linear function of PEposPE_{pos}.

Interactive Demo

Explore scaled dot-product attention, multi-head patterns, positional encodings, and the full architecture:

Attention Is All You Need — Interactive Explorer

Compute raw attention scores by multiplying Q and Kᵀ
I
love
deep
learning
I
-1.0
-0.1
1.2
0.6
love
-0.7
0.2
-0.3
0.9
deep
0.2
-0.3
1.0
-0.3
learning
0.7
-0.3
0.1
-0.5

The Architecture

The Transformer follows an encoder-decoder structure:

Encoder (6 identical layers):

  1. Multi-head self-attention
  2. Add & layer norm (residual connection)
  3. Position-wise feed-forward network (two linear layers, ReLU)
  4. Add & layer norm

Decoder (6 identical layers):

  1. Masked multi-head self-attention (causal: prevent attending to future positions)
  2. Add & norm
  3. Multi-head cross-attention (queries from decoder, keys/values from encoder)
  4. Add & norm
  5. Feed-forward network
  6. Add & norm

Key Hyperparameters

ParameterValueDescription
NN6Encoder/decoder layers
dmodeld_{\text{model}}512Embedding dimension
hh8Attention heads
dk=dvd_k = d_v64Per-head dimension (dmodel/hd_{\text{model}} / h)
dffd_{ff}2048Feed-forward inner dimension
Dropout0.1Applied throughout

Why It Worked

PropertyRNNTransformer
Sequential ops per layerO(n)O(1)
Maximum path lengthO(n)O(1)
Computation per layerO(n·d²)O(n²·d)
ParallelizableNoYes

For typical sequence lengths where n<dn < d, the self-attention layer is faster than recurrent layers. The O(1) maximum path length means any two positions can interact directly, solving the long-range dependency problem.

Results from the Paper

The Transformer achieved state-of-the-art on English-to-German and English-to-French translation:

  • EN-DE: 28.4 BLEU (surpassing all previous ensembles)
  • EN-FR: 41.0 BLEU (new single-model SOTA)
  • Training cost: 3.5 days on 8 P100 GPUs (fraction of competing models)

Legacy

This single paper spawned the entire modern AI landscape:

  • BERT (2018): encoder-only Transformer for understanding
  • GPT series (2018–): decoder-only Transformer for generation
  • T5, BART (2019): encoder-decoder variants
  • Vision Transformer (2020): applied to images
  • Modern LLMs (GPT-4, Claude, Gemini): scaled Transformers with trillions of tokens

The title proved prophetic—attention really was all you need.

Key Papers

Found an error or want to contribute? Edit on GitHub