Pre-training

The stage where a model learns broad patterns from a very large dataset

Pre-training is the first major stage in building a modern foundation model. During this stage, the model is exposed to a huge amount of data and learns broad patterns in language, code, images, or other modalities before it is adapted to a narrower task.

Good follow-up pages: GPT, BERT, Transformer, and Scaling Laws.

A Simple Example

Suppose you want a model that can answer biology questions. You could train it only on biology examples, but that would be limiting.

Instead, pre-training teaches the model broad skills first:

  • how sentences are structured
  • how facts are expressed in text
  • how code, math, or diagrams tend to look

After that broad stage, the model can be adapted more efficiently to specific tasks.

Why Pre-training Matters

Pre-training does three important things:

  1. It gives the model general-purpose representations
  2. It reduces the amount of labeled task-specific data you need later
  3. It creates a reusable base model for many downstream uses

That is why so much of modern AI development is really about building and adapting pre-trained models.

Common Objectives

Most pre-training uses self-supervised learning, which means the training signal comes from the data itself.

Causal Language Modeling

Predict the next token from the previous ones:

P(x)=i=1nP(xix<i)P(x) = \prod_{i=1}^n P(x_i | x_{<i})

This is the objective used by GPT-style models.

Masked Language Modeling

Hide some tokens and ask the model to fill them in from context.

This is the objective used by BERT-style models.

The Pre-training Pipeline

  1. Data curation: collect, filter, deduplicate, and clean large corpora
  2. Tokenization: convert raw text or other inputs into model-ready units
  3. Training: optimize the model across many steps on large compute clusters
  4. Evaluation: track loss and benchmark behavior

What Makes Pre-training Hard

  • huge compute requirements
  • messy internet-scale data
  • difficult decisions about data mixture and quality
  • strong interactions between model size, tokens, and compute budget

That is where Scaling Laws become useful.

Pre-training vs Fine-Tuning

StageGoal
Pre-trainingLearn broad patterns from massive data
Fine-tuningAdapt the model to a narrower behavior or task
Preference tuning / alignmentMake behavior more useful to people

What To Remember

  • Pre-training is the broad learning stage, not the final product
  • It usually uses self-supervised objectives
  • Most modern AI systems start with a strong pre-trained base model
Found an error or want to contribute? Edit on GitHub