#architecture

8 pages

The 2017 paper that introduced the Transformer architecture

Pre-training deep bidirectional representations for NLP

Normalizing layer inputs to accelerate deep network training

Two neural networks compete to generate realistic data

Autoregressive language models that learn to predict the next token

A sequence model that keeps a running state instead of attending to every token pair

Encoder-decoder architecture for mapping sequences to sequences

Self-attention models that process sequences in parallel