Applying Transformers directly to image patches for visual recognition
Vision Transformer (ViT) demonstrated that pure Transformers, applied directly to sequences of image patches, can match or exceed CNNs on image classification—especially when pre-trained on large datasets.
Core Idea: Images as Sequences
Instead of convolutions, ViT treats an image as a sequence of patches:
- Split image into fixed-size patches (e.g., 16×16)
- Flatten each patch into a vector
- Project through a linear embedding
- Add positional embeddings
- Process with standard Transformer encoder
where is the patch embedding projection and are learnable position embeddings.
Patch Embedding
For an image of size with patch size :
Each patch is projected to dimension :
The [CLS] Token
A learnable embedding is prepended to the sequence. After processing through L Transformer layers, its output serves as the image representation:
This is passed to an MLP head for classification.
Architecture
Standard Transformer encoder with:
- Multi-head self-attention
- MLP blocks (GELU activation)
- Layer normalization (pre-norm variant)
- Residual connections
| Model | Layers | Hidden | MLP | Heads | Params |
|---|---|---|---|---|---|
| ViT-B/16 | 12 | 768 | 3072 | 12 | 86M |
| ViT-L/16 | 24 | 1024 | 4096 | 16 | 307M |
| ViT-H/14 | 32 | 1280 | 5120 | 16 | 632M |
Interactive Visualization
See how ViT splits images into patches and applies self-attention:
Vision Transformer Patches
ViT splits an image into 4×4 = 16 patches. Click a patch to see its attention.
Process: Patches → Linear projection → + Position embeddings → Transformer → [CLS] output → Classification
Key Insight: Scale Matters
ViT underperforms CNNs when trained on small datasets (lacks inductive biases). But with large-scale pre-training (ImageNet-21k, JFT-300M), it excels:
“When pre-trained on large amounts of data, the Transformer architecture can match or exceed state-of-the-art CNNs.”
Impact
ViT unified vision and language under the same architecture, enabling:
- CLIP (vision-language)
- DALL-E (image generation)
- Multimodal models (GPT-4V, Gemini)
The patch embedding + Transformer paradigm now dominates computer vision.