The Annotated Transformer
Line-by-line PyTorch implementation of the Transformer architecture
Attention Is All You Need
The 2017 paper that introduced the Transformer architecture
Neural Machine Translation by Jointly Learning to Align and Translate
The paper that introduced the attention mechanism for sequence-to-sequence models
BERT: Bidirectional Transformers
Pre-training deep bidirectional representations for NLP
Chain-of-Thought Prompting
Eliciting step-by-step reasoning in language models for complex problem solving
CLIP: Contrastive Language-Image Pre-training
Learning visual concepts from natural language supervision
GPT: Generative Pre-Training
Autoregressive language models that learn to predict the next token
In-Context Learning
How large language models learn from examples in the prompt without weight updates
RLHF: Reinforcement Learning from Human Feedback
Teaching language models to prefer responses that people rank higher
The Unreasonable Effectiveness of Recurrent Neural Networks
Andrej Karpathy's influential blog post demonstrating RNN capabilities through character-level generation
Sequence to Sequence Learning
Encoder-decoder architecture for mapping sequences to sequences
Scaling Laws for Neural Language Models
Why bigger models, more data, and more compute lead to predictable gains
Transformer
Self-attention models that process sequences in parallel
Understanding LSTM Networks
Christopher Olah's visual guide to Long Short-Term Memory networks
Word2Vec: Word Embeddings
Learning dense vector representations of words from text