Vision Transformer (ViT)

Applying Transformers directly to image patches for visual recognition

Vision Transformer (ViT) demonstrated that pure Transformers, applied directly to sequences of image patches, can match or exceed CNNs on image classification—especially when pre-trained on large datasets.

Core Idea: Images as Sequences

Instead of convolutions, ViT treats an image as a sequence of patches:

  1. Split image into fixed-size patches (e.g., 16×16)
  2. Flatten each patch into a vector
  3. Project through a linear embedding
  4. Add positional embeddings
  5. Process with standard Transformer encoder
z0=[xclass;x1pE;x2pE;;xNpE]+Eposz_0 = [x_{class}; x_1^p E; x_2^p E; \ldots; x_N^p E] + E_{pos}

where EE is the patch embedding projection and EposE_{pos} are learnable position embeddings.

Patch Embedding

For an image of size H×WH \times W with patch size PP:

N=HWP2N = \frac{HW}{P^2}

Each patch xpRP2Cx_p \in \mathbb{R}^{P^2 \cdot C} is projected to dimension DD:

xpERD,ER(P2C)×Dx_p E \in \mathbb{R}^D, \quad E \in \mathbb{R}^{(P^2 \cdot C) \times D}

The [CLS] Token

A learnable embedding xclassx_{class} is prepended to the sequence. After processing through L Transformer layers, its output serves as the image representation:

y=LN(zL0)y = \text{LN}(z_L^0)

This is passed to an MLP head for classification.

Architecture

Standard Transformer encoder with:

  • Multi-head self-attention
  • MLP blocks (GELU activation)
  • Layer normalization (pre-norm variant)
  • Residual connections
ModelLayersHiddenMLPHeadsParams
ViT-B/161276830721286M
ViT-L/16241024409616307M
ViT-H/14321280512016632M

Interactive Visualization

See how ViT splits images into patches and applies self-attention:

Vision Transformer Patches

ViT splits an image into 4×4 = 16 patches. Click a patch to see its attention.

Image Patches
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[CLS] Token
CLS

Process: Patches → Linear projection → + Position embeddings → Transformer → [CLS] output → Classification

Key Insight: Scale Matters

ViT underperforms CNNs when trained on small datasets (lacks inductive biases). But with large-scale pre-training (ImageNet-21k, JFT-300M), it excels:

“When pre-trained on large amounts of data, the Transformer architecture can match or exceed state-of-the-art CNNs.”

Impact

ViT unified vision and language under the same architecture, enabling:

  • CLIP (vision-language)
  • DALL-E (image generation)
  • Multimodal models (GPT-4V, Gemini)

The patch embedding + Transformer paradigm now dominates computer vision.

Found an error or want to contribute? Edit on GitHub