#deep-learning

49 pages

Adam Optimizer

Adaptive learning rates with momentum for deep learning

AlexNet

The deep CNN that won ImageNet 2012 and sparked the deep learning revolution

The Annotated Transformer

Line-by-line PyTorch implementation of the Transformer architecture

Attention Is All You Need

The 2017 paper that introduced the Transformer architecture

Backpropagation

The algorithm that enables neural networks to learn by computing gradients efficiently

Neural Machine Translation by Jointly Learning to Align and Translate

The paper that introduced the attention mechanism for sequence-to-sequence models

BERT: Bidirectional Transformers

Pre-training deep bidirectional representations for NLP

Batch Normalization

Normalizing layer inputs to accelerate deep network training

Chain-of-Thought Prompting

Eliciting step-by-step reasoning in language models for complex problem solving

CLIP: Contrastive Language-Image Pre-training

Learning visual concepts from natural language supervision

CS231n: CNNs for Visual Recognition

Stanford's foundational course on deep learning for computer vision

Deep Speech 2: End-to-End Speech Recognition

Scaling up end-to-end speech recognition with RNNs and CTC

Diffusion Models

Generative models that learn to denoise, enabling high-quality image and video synthesis

Multi-Scale Context Aggregation by Dilated Convolutions

Expanding receptive fields exponentially without losing resolution or adding parameters

Deep Q-Networks (DQN)

Combining Q-learning with deep neural networks for Atari-level game playing

Dropout: Regularization for Neural Networks

Randomly dropping units during training to prevent overfitting

Generative Adversarial Networks

Two neural networks compete to generate realistic data

GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism

Training giant neural networks by pipelining micro-batches across devices

GPT: Generative Pre-Training

Autoregressive language models that learn to predict the next token

In-Context Learning

How large language models learn from examples in the prompt without weight updates

Layer Normalization

Normalizing each example across its features

Latent Diffusion Models

High-resolution image generation by diffusing in learned latent spaces

Mamba: State Space Models

A sequence model that keeps a running state instead of attending to every token pair

Maximum Likelihood Reinforcement Learning (MaxRL)

A recent idea for training models on pass-fail tasks when sampling matters

Keeping Neural Networks Simple by Minimizing the Description Length of the Weights

Hinton's MDL approach to neural network regularization through noisy weights

Neural Message Passing for Quantum Chemistry

A unified framework for graph neural networks applied to molecular property prediction

Neural Turing Machines

Neural networks augmented with external memory and attention-based read/write heads

NODE (Neural Oblivious Decision Ensembles)

Differentiable decision trees and oblivious ensembles for tabular learning

Order Matters: Sequence to Sequence for Sets

How input and output ordering affects seq2seq learning on set-structured data

Pointer Networks

Neural architecture that outputs pointers to input positions, enabling variable-size outputs

Policy Gradient Methods

Directly optimizing policies through gradient ascent on expected returns

Proximal Policy Optimization (PPO)

A stable, sample-efficient policy gradient algorithm for reinforcement learning

Pre-training

The stage where a model learns broad patterns from a very large dataset

Reinforcement Learning

Learning by trial and error through rewards

A Simple Neural Network Module for Relational Reasoning

Relation Networks for learning to reason about object relationships

Identity Mappings in Deep Residual Networks

Pre-activation ResNet design that enables training of 1000+ layer networks

Relational Recurrent Neural Networks

RNNs with relational memory that enables reasoning across time

ResNet

Deep residual learning with skip connections that enabled training of 152+ layer networks

RLHF: Reinforcement Learning from Human Feedback

Teaching language models to prefer responses that people rank higher

The Unreasonable Effectiveness of Recurrent Neural Networks

Andrej Karpathy's influential blog post demonstrating RNN capabilities through character-level generation

Recurrent Neural Network Regularization

How to apply dropout to LSTMs without disrupting memory dynamics

Sequence to Sequence Learning

Encoder-decoder architecture for mapping sequences to sequences

Scaling Laws for Neural Language Models

Why bigger models, more data, and more compute lead to predictable gains

Transformer

Self-attention models that process sequences in parallel

Understanding LSTM Networks

Christopher Olah's visual guide to Long Short-Term Memory networks

Variational Autoencoder (VAE)

Probabilistic generative model with structured latent space

Vision Transformer (ViT)

Applying Transformers directly to image patches for visual recognition

Variational Lossy Autoencoder

Understanding VAEs as compression systems with a rate-distortion trade-off

Word2Vec: Word Embeddings

Learning dense vector representations of words from text