AlexNet
The deep CNN that won ImageNet 2012 and sparked the deep learning revolution
CLIP: Contrastive Language-Image Pre-training
Learning visual concepts from natural language supervision
CS231n: CNNs for Visual Recognition
Stanford's foundational course on deep learning for computer vision
Diffusion Models
Generative models that learn to denoise, enabling high-quality image and video synthesis
Multi-Scale Context Aggregation by Dilated Convolutions
Expanding receptive fields exponentially without losing resolution or adding parameters
Generative Adversarial Networks
Two neural networks compete to generate realistic data
Latent Diffusion Models
High-resolution image generation by diffusing in learned latent spaces
Identity Mappings in Deep Residual Networks
Pre-activation ResNet design that enables training of 1000+ layer networks
ResNet
Deep residual learning with skip connections that enabled training of 152+ layer networks
Vision Transformer (ViT)
Applying Transformers directly to image patches for visual recognition