The Annotated Transformer
Line-by-line PyTorch implementation of the Transformer architecture
Attention Is All You Need
The 2017 paper that introduced the Transformer architecture
BERT: Bidirectional Transformers
Pre-training deep bidirectional representations for NLP
GPT: Generative Pre-Training
Autoregressive language models that learn to predict the next token
Transformer
Self-attention models that process sequences in parallel