Attention Is All You Need

Attention Is All You Need is the 2017 paper that introduced the Transformer. If you want the general concept first, start with Transformer. This page focuses on what the paper changed, why it mattered, and what results it achieved.

Before this paper, neural machine translation was dominated by recurrent models such as LSTMs and GRUs. The paper’s big claim was that a sequence model could work using attention alone.

The Problem the Paper Solved

Older sequence-to-sequence models had three recurring problems:

They were slow to train because tokens had to be processed one after another
They struggled with long-range dependencies because information had to travel through many recurrent steps
They compressed the whole input too aggressively in encoder-decoder setups

The paper asked: what if every token could directly look at every other token it needs?

The Core Innovation

The main building block was scaled dot-product attention:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

In plain English:

compare each token’s query to other tokens’ keys
turn those scores into weights
use the weights to mix the values

That gave the model a direct way to connect related words, even when they were far apart in the sentence.

Multi-Head Attention

Instead of learning one attention pattern, the paper learned several at once:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

This let different heads focus on different relationships, such as grammar, word alignment, or longer-range context.

Positional Encoding

Attention by itself does not encode order, so the paper added sinusoidal positional encodings:

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

You can think of these as extra signals that tell the model where each token appears in the sequence.

Interactive Demo

Explore scaled dot-product attention, multi-head patterns, positional encodings, and the full architecture:

Attention Is All You Need — Interactive Explorer

Compute raw attention scores by multiplying Q and Kᵀ

love

deep

learning

-1.0

-0.1

1.2

0.6

love

-0.7

0.2

-0.3

0.9

deep

0.2

-0.3

1.0

-0.3

learning

0.7

-0.3

0.1

-0.5

The Architecture in Words

The paper used an encoder-decoder Transformer:

Encoder:

Multi-head self-attention
Residual connection + layer normalization
Feed-forward network
Another residual connection + layer normalization

Decoder:

Masked self-attention so future tokens stay hidden
Cross-attention to the encoder outputs
Feed-forward network

The original model used 6 encoder layers, 6 decoder layers, model width 512, and 8 heads.

Why It Worked

Property	RNN	Transformer
Sequential ops per layer	O(n)	O(1)
Maximum path length	O(n)	O(1)
Parallelizable	No	Yes

Two practical consequences mattered:

training became much more parallel
any two positions could interact in one step instead of many

Results from the Paper

The paper reached state-of-the-art machine translation results:

EN-DE: 28.4 BLEU
EN-FR: 41.0 BLEU
Training time: about 3.5 days on 8 P100 GPUs

For students, the exact numbers matter less than the message: this was not just a neat idea; it won on real benchmarks.

Legacy

This paper directly led to:

BERT for bidirectional language understanding
GPT for autoregressive text generation
Vision Transformer for images
modern LLMs built on scaled Transformer variants

What To Remember

The paper introduced the Transformer
The key mechanism was self-attention
The breakthrough was both conceptual and practical: better long-range modeling and much better parallelism

Key Papers

Attention Is All You Need - Vaswani et al., 2017
https://arxiv.org/abs/1706.03762
Neural Machine Translation by Jointly Learning to Align and Translate - Bahdanau et al., 2014
https://arxiv.org/abs/1409.0473
The Annotated Transformer - Rush, 2018
https://nlp.seas.harvard.edu/annotated-transformer/