The Annotated Transformer
Line-by-line PyTorch implementation of the Transformer architecture
Neural Machine Translation by Jointly Learning to Align and Translate
The paper that introduced the attention mechanism for sequence-to-sequence models
The Unreasonable Effectiveness of Recurrent Neural Networks
Andrej Karpathy's influential blog post demonstrating RNN capabilities through character-level generation
Scaling Laws for Neural Language Models
Empirical laws governing how language model performance scales with compute, data, and parameters
Attention Is All You Need
The Transformer architecture that replaced recurrence with self-attention
Understanding LSTM Networks
Christopher Olah's visual guide to Long Short-Term Memory networks