GPT: Generative Pre-Training

Autoregressive language models that learn to predict the next token

GPT (Generative Pre-trained Transformer) introduced the paradigm of large-scale autoregressive language modeling. Starting with GPT-1 (2018) from OpenAI, the series demonstrated that scaling transformer decoders creates increasingly capable general-purpose models.

Core Idea: Next Token Prediction

GPT models learn to predict the next token given all previous tokens:

P(x1,x2,...,xn)=i=1nP(xix1,...,xi1)P(x_1, x_2, ..., x_n) = \prod_{i=1}^{n} P(x_i | x_1, ..., x_{i-1})

This simple objective—predicting what comes next—turns out to encode rich understanding of language, facts, and reasoning.

The Architecture

GPT uses the Transformer decoder with causal (masked) self-attention:

Attention(Q,K,V)=softmax(QKTdk+M)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V

where MM is a causal mask preventing attention to future tokens.

ModelYearParametersContextTraining Data
GPT-12018117M512BookCorpus
GPT-220191.5B1024WebText (40GB)
GPT-32020175B2048570GB filtered
GPT-42023~1.7T*8K-128KUnknown

*Estimated, not officially disclosed

Interactive Demo

Explore autoregressive generation and causal attention:

GPT: Autoregressive Generation

The
quick
brown
fox
How GPT works: Given a sequence of tokens, GPT predicts the most likely next token. It can only attend to previous tokens (causal masking), making generation natural: each new token is sampled from the predicted distribution, then appended to the context.
1
GPT-1
117M
2
GPT-2
1.5B
3
GPT-3
175B
4
GPT-4
~1.7T

GPT-2: The Scaling Revelation

GPT-2 demonstrated emergent capabilities from scale:

  • Zero-shot task performance without fine-tuning
  • Coherent long-form text generation
  • Basic reasoning and arithmetic
  • Translation and summarization (without training for it)

The paper’s key insight: “Language models are unsupervised multitask learners.”

GPT-3: In-Context Learning

GPT-3 introduced few-shot prompting:

Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
cheese =>

The model learns new tasks from examples in the prompt, without gradient updates. This emerged purely from scale.

Training Objective

Standard language modeling loss:

L=i=1nlogP(xix1,...,xi1;θ)\mathcal{L} = -\sum_{i=1}^{n} \log P(x_i | x_1, ..., x_{i-1}; \theta)

GPT-3 additionally uses:

  • Sparse attention patterns for efficiency
  • Model parallelism across many GPUs
  • Careful data deduplication

Why Decoder-Only?

BERT uses encoders (bidirectional), GPT uses decoders (causal). Why?

AspectEncoder (BERT)Decoder (GPT)
TrainingMasked LMNext token
GenerationCannot generateNatural generation
UnderstandingBoth directionsLeft context only
Use caseClassification, QAGeneration, chat

Modern LLMs (GPT-4, Claude) use decoder-only architectures because generation is the primary interface.

Scaling Laws

GPT-3 revealed predictable scaling:

L(N)(NcN)αNL(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}

Loss decreases as a power law with model size NN, data size DD, and compute CC.

From GPT to ChatGPT

The path to conversational AI:

  1. GPT-3: Raw language model
  2. InstructGPT: Fine-tuned to follow instructions (RLHF)
  3. ChatGPT: Optimized for dialogue

RLHF (Reinforcement Learning from Human Feedback) aligns the model with human preferences.

Historical Impact

GPT established:

  • Scaling hypothesis: Bigger models → better capabilities
  • Emergent abilities: Capabilities appearing at scale
  • Prompt engineering: Programming via natural language
  • Foundation models: One model, many tasks

Key Papers

Found an error or want to contribute? Edit on GitHub