Attention Is All You Need
The 2017 paper that introduced the Transformer architecture
BERT: Bidirectional Transformers
Pre-training deep bidirectional representations for NLP
Batch Normalization
Normalizing layer inputs to accelerate deep network training
Generative Adversarial Networks
Two neural networks compete to generate realistic data
GPT: Generative Pre-Training
Autoregressive language models that learn to predict the next token
Mamba: State Space Models
A sequence model that keeps a running state instead of attending to every token pair
Sequence to Sequence Learning
Encoder-decoder architecture for mapping sequences to sequences
Transformer
Self-attention models that process sequences in parallel