Attention Is All You Need

The 2017 paper that introduced the Transformer architecture

Attention Is All You Need is the 2017 paper that introduced the Transformer. If you want the general concept first, start with Transformer. This page focuses on what the paper changed, why it mattered, and what results it achieved.

Before this paper, neural machine translation was dominated by recurrent models such as LSTMs and GRUs. The paper’s big claim was that a sequence model could work using attention alone.

The Problem the Paper Solved

Older sequence-to-sequence models had three recurring problems:

  • They were slow to train because tokens had to be processed one after another
  • They struggled with long-range dependencies because information had to travel through many recurrent steps
  • They compressed the whole input too aggressively in encoder-decoder setups

The paper asked: what if every token could directly look at every other token it needs?

The Core Innovation

The main building block was scaled dot-product attention:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

In plain English:

  • compare each token’s query to other tokens’ keys
  • turn those scores into weights
  • use the weights to mix the values

That gave the model a direct way to connect related words, even when they were far apart in the sentence.

Multi-Head Attention

Instead of learning one attention pattern, the paper learned several at once:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

This let different heads focus on different relationships, such as grammar, word alignment, or longer-range context.

Positional Encoding

Attention by itself does not encode order, so the paper added sinusoidal positional encodings:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

You can think of these as extra signals that tell the model where each token appears in the sequence.

Interactive Demo

Explore scaled dot-product attention, multi-head patterns, positional encodings, and the full architecture:

Attention Is All You Need — Interactive Explorer

Compute raw attention scores by multiplying Q and Kᵀ
I
love
deep
learning
I
-1.0
-0.1
1.2
0.6
love
-0.7
0.2
-0.3
0.9
deep
0.2
-0.3
1.0
-0.3
learning
0.7
-0.3
0.1
-0.5

The Architecture in Words

The paper used an encoder-decoder Transformer:

Encoder:

  1. Multi-head self-attention
  2. Residual connection + layer normalization
  3. Feed-forward network
  4. Another residual connection + layer normalization

Decoder:

  1. Masked self-attention so future tokens stay hidden
  2. Cross-attention to the encoder outputs
  3. Feed-forward network

The original model used 6 encoder layers, 6 decoder layers, model width 512, and 8 heads.

Why It Worked

PropertyRNNTransformer
Sequential ops per layerO(n)O(1)
Maximum path lengthO(n)O(1)
ParallelizableNoYes

Two practical consequences mattered:

  • training became much more parallel
  • any two positions could interact in one step instead of many

Results from the Paper

The paper reached state-of-the-art machine translation results:

  • EN-DE: 28.4 BLEU
  • EN-FR: 41.0 BLEU
  • Training time: about 3.5 days on 8 P100 GPUs

For students, the exact numbers matter less than the message: this was not just a neat idea; it won on real benchmarks.

Legacy

This paper directly led to:

  • BERT for bidirectional language understanding
  • GPT for autoregressive text generation
  • Vision Transformer for images
  • modern LLMs built on scaled Transformer variants

What To Remember

  • The paper introduced the Transformer
  • The key mechanism was self-attention
  • The breakthrough was both conceptual and practical: better long-range modeling and much better parallelism

Key Papers

Found an error or want to contribute? Edit on GitHub