Vision Transformer (ViT)

Vision Transformer (ViT) demonstrated that pure Transformers, applied directly to sequences of image patches, can match or exceed CNNs on image classification—especially when pre-trained on large datasets.

Core Idea: Images as Sequences

Instead of convolutions, ViT treats an image as a sequence of patches:

Split image into fixed-size patches (e.g., 16×16)
Flatten each patch into a vector
Project through a linear embedding
Add positional embeddings
Process with standard Transformer encoder

z_0 = [x_{class}; x_1^p E; x_2^p E; \ldots; x_N^p E] + E_{pos}

where $E$ is the patch embedding projection and $E_{pos}$ are learnable position embeddings.

Patch Embedding

For an image of size $H \times W$ with patch size $P$ :

N = \frac{HW}{P^2}

Each patch $x_p \in \mathbb{R}^{P^2 \cdot C}$ is projected to dimension $D$ :

x_p E \in \mathbb{R}^D, \quad E \in \mathbb{R}^{(P^2 \cdot C) \times D}

The [CLS] Token

A learnable embedding $x_{class}$ is prepended to the sequence. After processing through L Transformer layers, its output serves as the image representation:

y = \text{LN}(z_L^0)

This is passed to an MLP head for classification.

Architecture

Standard Transformer encoder with:

Multi-head self-attention
MLP blocks (GELU activation)
Layer normalization (pre-norm variant)
Residual connections

Model	Layers	Hidden	MLP	Heads	Params
ViT-B/16	12	768	3072	12	86M
ViT-L/16	24	1024	4096	16	307M
ViT-H/14	32	1280	5120	16	632M

Interactive Visualization

See how ViT splits images into patches and applies self-attention:

Vision Transformer Patches

ViT splits an image into 4×4 = 16 patches. Click a patch to see its attention.

Image Patches

[CLS] Token

CLS

Process: Patches → Linear projection → + Position embeddings → Transformer → [CLS] output → Classification

Key Insight: Scale Matters

ViT underperforms CNNs when trained on small datasets (lacks inductive biases). But with large-scale pre-training (ImageNet-21k, JFT-300M), it excels:

“When pre-trained on large amounts of data, the Transformer architecture can match or exceed state-of-the-art CNNs.”

Impact

ViT unified vision and language under the same architecture, enabling:

CLIP (vision-language)
DALL-E (image generation)
Multimodal models (GPT-4V, Gemini)

The patch embedding + Transformer paradigm now dominates computer vision.