Learning visual concepts from natural language supervision
CLIP (Contrastive Language-Image Pre-training) learns to connect images and text by training on 400 million image-text pairs from the internet. This simple approach produces remarkably flexible visual representations.
Core Idea: Learn from Captions
Instead of predicting fixed categories, CLIP learns to match images with their natural language descriptions:
where is an image encoder (ViT or ResNet) and is a text encoder (Transformer).
Contrastive Learning
Given a batch of image-text pairs, CLIP maximizes similarity of correct pairs while minimizing incorrect ones:
where and is a learned temperature.
Zero-Shot Classification
CLIP enables classification without training on the target dataset:
- Create text prompts: “a photo of a {class}”
- Encode all prompts with text encoder
- Encode image with image encoder
- Predict class with highest image-text similarity
Interactive Visualization
Explore how CLIP matches images to text descriptions:
CLIP: Image-Text Matching
Click an image or text to see similarity scores. CLIP learns to maximize diagonal (matching pairs).
Zero-shot: To classify a new image, compute similarity with text prompts like "a photo of a {class}" and pick the highest.
Prompt Engineering
Classification accuracy depends on prompt phrasing:
| Prompt Template | ImageNet Acc |
|---|---|
| ”{class}“ | 63.5% |
| “a photo of a {class}“ | 68.3% |
| “a good photo of a {class}“ | 69.1% |
| Ensemble of 80 templates | 76.2% |
Why CLIP Works
- Scale: 400M image-text pairs provide diverse supervision
- Natural language: Captures nuanced visual concepts
- Contrastive learning: Efficient use of batch information
- Zero-shot transfer: No task-specific training needed
Applications
CLIP powers:
- Image search: Find images matching text queries
- DALL-E/Stable Diffusion: Guide image generation
- Open-vocabulary detection: Detect objects by name
- Multimodal models: Visual understanding in LLMs
Limitations
- Struggles with fine-grained recognition
- Biases from internet data
- Text encoder limits complex reasoning
- Counting and spatial relationships remain challenging