#ilya-sutskever

28 pages

AlexNet

The deep CNN that won ImageNet 2012 and sparked the deep learning revolution

The Annotated Transformer

Line-by-line PyTorch implementation of the Transformer architecture

Neural Machine Translation by Jointly Learning to Align and Translate

The paper that introduced the attention mechanism for sequence-to-sequence models

Quantifying the Rise and Fall of Complexity in Closed Systems

The Coffee Automaton paper formalizing how complexity peaks then declines

The First Law of Complexodynamics

Why complexity rises then falls while entropy only increases

CS231n: CNNs for Visual Recognition

Stanford's foundational course on deep learning for computer vision

Deep Speech 2: End-to-End Speech Recognition

Scaling up end-to-end speech recognition with RNNs and CTC

Multi-Scale Context Aggregation by Dilated Convolutions

Expanding receptive fields exponentially without losing resolution or adding parameters

GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism

Training giant neural networks by pipelining micro-batches across devices

Kolmogorov Complexity and Algorithmic Randomness

The mathematical foundation for measuring information content and randomness

Machine Super Intelligence

Shane Legg's PhD thesis formalizing universal intelligence and the AIXI agent

A Tutorial Introduction to the Minimum Description Length Principle

Grünwald's comprehensive guide to MDL for model selection and learning

Keeping Neural Networks Simple by Minimizing the Description Length of the Weights

Hinton's MDL approach to neural network regularization through noisy weights

Neural Message Passing for Quantum Chemistry

A unified framework for graph neural networks applied to molecular property prediction

Neural Turing Machines

Neural networks augmented with external memory and attention-based read/write heads

Order Matters: Sequence to Sequence for Sets

How input and output ordering affects seq2seq learning on set-structured data

Pointer Networks

Neural architecture that outputs pointers to input positions, enabling variable-size outputs

A Simple Neural Network Module for Relational Reasoning

Relation Networks for learning to reason about object relationships

Relational Recurrent Neural Networks

RNNs with relational memory that enables reasoning across time

Identity Mappings in Deep Residual Networks

Pre-activation ResNet design that enables training of 1000+ layer networks

ResNet

Deep residual learning with skip connections that enabled training of 152+ layer networks

The Unreasonable Effectiveness of Recurrent Neural Networks

Andrej Karpathy's influential blog post demonstrating RNN capabilities through character-level generation

Recurrent Neural Network Regularization

How to apply dropout to LSTMs without disrupting memory dynamics

Scaling Laws for Neural Language Models

Empirical laws governing how language model performance scales with compute, data, and parameters

Attention Is All You Need

The Transformer architecture that replaced recurrence with self-attention

Understanding LSTM Networks

Christopher Olah's visual guide to Long Short-Term Memory networks

Variational Autoencoder (VAE)

Probabilistic generative model with structured latent space

Variational Lossy Autoencoder

Connecting VAEs to lossy compression and the bits-back coding argument