AlexNet
The deep CNN that won ImageNet 2012 and sparked the deep learning revolution
The Annotated Transformer
Line-by-line PyTorch implementation of the Transformer architecture
Neural Machine Translation by Jointly Learning to Align and Translate
The paper that introduced the attention mechanism for sequence-to-sequence models
CS231n: CNNs for Visual Recognition
Stanford's foundational course on deep learning for computer vision
Deep Speech 2: End-to-End Speech Recognition
Scaling up end-to-end speech recognition with RNNs and CTC
Multi-Scale Context Aggregation by Dilated Convolutions
Expanding receptive fields exponentially without losing resolution or adding parameters
GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
Training giant neural networks by pipelining micro-batches across devices
Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
Hinton's MDL approach to neural network regularization through noisy weights
Neural Message Passing for Quantum Chemistry
A unified framework for graph neural networks applied to molecular property prediction
Neural Turing Machines
Neural networks augmented with external memory and attention-based read/write heads
NODE (Neural Oblivious Decision Ensembles)
Differentiable decision trees and oblivious ensembles for tabular learning
Order Matters: Sequence to Sequence for Sets
How input and output ordering affects seq2seq learning on set-structured data
Pointer Networks
Neural architecture that outputs pointers to input positions, enabling variable-size outputs
Pre-training
The initial phase of training foundation models on vast amounts of data
A Simple Neural Network Module for Relational Reasoning
Relation Networks for learning to reason about object relationships
Relational Recurrent Neural Networks
RNNs with relational memory that enables reasoning across time
Identity Mappings in Deep Residual Networks
Pre-activation ResNet design that enables training of 1000+ layer networks
ResNet
Deep residual learning with skip connections that enabled training of 152+ layer networks
The Unreasonable Effectiveness of Recurrent Neural Networks
Andrej Karpathy's influential blog post demonstrating RNN capabilities through character-level generation
Recurrent Neural Network Regularization
How to apply dropout to LSTMs without disrupting memory dynamics
Scaling Laws for Neural Language Models
Empirical laws governing how language model performance scales with compute, data, and parameters
Attention Is All You Need
The Transformer architecture that replaced recurrence with self-attention
Understanding LSTM Networks
Christopher Olah's visual guide to Long Short-Term Memory networks
Variational Autoencoder (VAE)
Probabilistic generative model with structured latent space
Variational Lossy Autoencoder
Connecting VAEs to lossy compression and the bits-back coding argument