Autoregressive language models that learn to predict the next token
GPT (Generative Pre-trained Transformer) introduced the paradigm of large-scale autoregressive language modeling. Starting with GPT-1 (2018) from OpenAI, the series demonstrated that scaling transformer decoders creates increasingly capable general-purpose models.
Core Idea: Next Token Prediction
GPT models learn to predict the next token given all previous tokens:
This simple objective—predicting what comes next—turns out to encode rich understanding of language, facts, and reasoning.
The Architecture
GPT uses the Transformer decoder with causal (masked) self-attention:
where is a causal mask preventing attention to future tokens.
| Model | Year | Parameters | Context | Training Data |
|---|---|---|---|---|
| GPT-1 | 2018 | 117M | 512 | BookCorpus |
| GPT-2 | 2019 | 1.5B | 1024 | WebText (40GB) |
| GPT-3 | 2020 | 175B | 2048 | 570GB filtered |
| GPT-4 | 2023 | ~1.7T* | 8K-128K | Unknown |
*Estimated, not officially disclosed
Interactive Demo
Explore autoregressive generation and causal attention:
GPT: Autoregressive Generation
GPT-2: The Scaling Revelation
GPT-2 demonstrated emergent capabilities from scale:
- Zero-shot task performance without fine-tuning
- Coherent long-form text generation
- Basic reasoning and arithmetic
- Translation and summarization (without training for it)
The paper’s key insight: “Language models are unsupervised multitask learners.”
GPT-3: In-Context Learning
GPT-3 introduced few-shot prompting:
Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
cheese =>
The model learns new tasks from examples in the prompt, without gradient updates. This emerged purely from scale.
Training Objective
Standard language modeling loss:
GPT-3 additionally uses:
- Sparse attention patterns for efficiency
- Model parallelism across many GPUs
- Careful data deduplication
Why Decoder-Only?
BERT uses encoders (bidirectional), GPT uses decoders (causal). Why?
| Aspect | Encoder (BERT) | Decoder (GPT) |
|---|---|---|
| Training | Masked LM | Next token |
| Generation | Cannot generate | Natural generation |
| Understanding | Both directions | Left context only |
| Use case | Classification, QA | Generation, chat |
Modern LLMs (GPT-4, Claude) use decoder-only architectures because generation is the primary interface.
Scaling Laws
GPT-3 revealed predictable scaling:
Loss decreases as a power law with model size , data size , and compute .
From GPT to ChatGPT
The path to conversational AI:
- GPT-3: Raw language model
- InstructGPT: Fine-tuned to follow instructions (RLHF)
- ChatGPT: Optimized for dialogue
RLHF (Reinforcement Learning from Human Feedback) aligns the model with human preferences.
Historical Impact
GPT established:
- Scaling hypothesis: Bigger models → better capabilities
- Emergent abilities: Capabilities appearing at scale
- Prompt engineering: Programming via natural language
- Foundation models: One model, many tasks
Key Papers
- Improving Language Understanding by Generative Pre-Training – Radford et al., 2018
https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf - Language Models are Unsupervised Multitask Learners (GPT-2) – Radford et al., 2019
https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf - Language Models are Few-Shot Learners (GPT-3) – Brown et al., 2020
https://arxiv.org/abs/2005.14165 - Training language models to follow instructions with human feedback (InstructGPT) – Ouyang et al., 2022
https://arxiv.org/abs/2203.02155