Pre-training | AIpedia

Pre-training is the first major stage in building a modern foundation model. During this stage, the model is exposed to a huge amount of data and learns broad patterns in language, code, images, or other modalities before it is adapted to a narrower task.

Good follow-up pages: GPT, BERT, Transformer, and Scaling Laws.

A Simple Example

Suppose you want a model that can answer biology questions. You could train it only on biology examples, but that would be limiting.

Instead, pre-training teaches the model broad skills first:

how sentences are structured
how facts are expressed in text
how code, math, or diagrams tend to look

After that broad stage, the model can be adapted more efficiently to specific tasks.

Why Pre-training Matters

Pre-training does three important things:

It gives the model general-purpose representations
It reduces the amount of labeled task-specific data you need later
It creates a reusable base model for many downstream uses

That is why so much of modern AI development is really about building and adapting pre-trained models.

Common Objectives

Most pre-training uses self-supervised learning, which means the training signal comes from the data itself.

Causal Language Modeling

Predict the next token from the previous ones:

P(x) = \prod_{i=1}^n P(x_i | x_{<i})

This is the objective used by GPT-style models.

Masked Language Modeling

Hide some tokens and ask the model to fill them in from context.

This is the objective used by BERT-style models.

The Pre-training Pipeline

Data curation: collect, filter, deduplicate, and clean large corpora
Tokenization: convert raw text or other inputs into model-ready units
Training: optimize the model across many steps on large compute clusters
Evaluation: track loss and benchmark behavior

What Makes Pre-training Hard

huge compute requirements
messy internet-scale data
difficult decisions about data mixture and quality
strong interactions between model size, tokens, and compute budget

That is where Scaling Laws become useful.

Pre-training vs Fine-Tuning

Stage	Goal
Pre-training	Learn broad patterns from massive data
Fine-tuning	Adapt the model to a narrower behavior or task
Preference tuning / alignment	Make behavior more useful to people

What To Remember

Pre-training is the broad learning stage, not the final product
It usually uses self-supervised objectives
Most modern AI systems start with a strong pre-trained base model