Adam Optimizer
Adaptive learning rates with momentum for deep learning
AlexNet
The deep CNN that won ImageNet 2012 and sparked the deep learning revolution
The Annotated Transformer
Line-by-line PyTorch implementation of the Transformer architecture
Attention Is All You Need
The 2017 paper that introduced the Transformer architecture
Backpropagation
The algorithm that enables neural networks to learn by computing gradients efficiently
Neural Machine Translation by Jointly Learning to Align and Translate
The paper that introduced the attention mechanism for sequence-to-sequence models
BERT: Bidirectional Transformers
Pre-training deep bidirectional representations for NLP
Batch Normalization
Normalizing layer inputs to accelerate deep network training
Chain-of-Thought Prompting
Eliciting step-by-step reasoning in language models for complex problem solving
CLIP: Contrastive Language-Image Pre-training
Learning visual concepts from natural language supervision
CS231n: CNNs for Visual Recognition
Stanford's foundational course on deep learning for computer vision
Deep Speech 2: End-to-End Speech Recognition
Scaling up end-to-end speech recognition with RNNs and CTC
Diffusion Models
Generative models that learn to denoise, enabling high-quality image and video synthesis
Multi-Scale Context Aggregation by Dilated Convolutions
Expanding receptive fields exponentially without losing resolution or adding parameters
Deep Q-Networks (DQN)
Combining Q-learning with deep neural networks for Atari-level game playing
Dropout: Regularization for Neural Networks
Randomly dropping units during training to prevent overfitting
Generative Adversarial Networks
Two neural networks compete to generate realistic data
GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
Training giant neural networks by pipelining micro-batches across devices
GPT: Generative Pre-Training
Autoregressive language models that learn to predict the next token
In-Context Learning
How large language models learn from examples in the prompt without weight updates
Layer Normalization
Normalizing each example across its features
Latent Diffusion Models
High-resolution image generation by diffusing in learned latent spaces
Mamba: State Space Models
A sequence model that keeps a running state instead of attending to every token pair
Maximum Likelihood Reinforcement Learning (MaxRL)
A recent idea for training models on pass-fail tasks when sampling matters
Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
Hinton's MDL approach to neural network regularization through noisy weights
Neural Message Passing for Quantum Chemistry
A unified framework for graph neural networks applied to molecular property prediction
Neural Turing Machines
Neural networks augmented with external memory and attention-based read/write heads
NODE (Neural Oblivious Decision Ensembles)
Differentiable decision trees and oblivious ensembles for tabular learning
Order Matters: Sequence to Sequence for Sets
How input and output ordering affects seq2seq learning on set-structured data
Pointer Networks
Neural architecture that outputs pointers to input positions, enabling variable-size outputs
Policy Gradient Methods
Directly optimizing policies through gradient ascent on expected returns
Proximal Policy Optimization (PPO)
A stable, sample-efficient policy gradient algorithm for reinforcement learning
Pre-training
The stage where a model learns broad patterns from a very large dataset
Reinforcement Learning
Learning by trial and error through rewards
A Simple Neural Network Module for Relational Reasoning
Relation Networks for learning to reason about object relationships
Identity Mappings in Deep Residual Networks
Pre-activation ResNet design that enables training of 1000+ layer networks
Relational Recurrent Neural Networks
RNNs with relational memory that enables reasoning across time
ResNet
Deep residual learning with skip connections that enabled training of 152+ layer networks
RLHF: Reinforcement Learning from Human Feedback
Teaching language models to prefer responses that people rank higher
The Unreasonable Effectiveness of Recurrent Neural Networks
Andrej Karpathy's influential blog post demonstrating RNN capabilities through character-level generation
Recurrent Neural Network Regularization
How to apply dropout to LSTMs without disrupting memory dynamics
Sequence to Sequence Learning
Encoder-decoder architecture for mapping sequences to sequences
Scaling Laws for Neural Language Models
Why bigger models, more data, and more compute lead to predictable gains
Transformer
Self-attention models that process sequences in parallel
Understanding LSTM Networks
Christopher Olah's visual guide to Long Short-Term Memory networks
Variational Autoencoder (VAE)
Probabilistic generative model with structured latent space
Vision Transformer (ViT)
Applying Transformers directly to image patches for visual recognition
Variational Lossy Autoencoder
Understanding VAEs as compression systems with a rate-distortion trade-off
Word2Vec: Word Embeddings
Learning dense vector representations of words from text