The stage where a model learns broad patterns from a very large dataset
Pre-training is the first major stage in building a modern foundation model. During this stage, the model is exposed to a huge amount of data and learns broad patterns in language, code, images, or other modalities before it is adapted to a narrower task.
Good follow-up pages: GPT, BERT, Transformer, and Scaling Laws.
A Simple Example
Suppose you want a model that can answer biology questions. You could train it only on biology examples, but that would be limiting.
Instead, pre-training teaches the model broad skills first:
- how sentences are structured
- how facts are expressed in text
- how code, math, or diagrams tend to look
After that broad stage, the model can be adapted more efficiently to specific tasks.
Why Pre-training Matters
Pre-training does three important things:
- It gives the model general-purpose representations
- It reduces the amount of labeled task-specific data you need later
- It creates a reusable base model for many downstream uses
That is why so much of modern AI development is really about building and adapting pre-trained models.
Common Objectives
Most pre-training uses self-supervised learning, which means the training signal comes from the data itself.
Causal Language Modeling
Predict the next token from the previous ones:
This is the objective used by GPT-style models.
Masked Language Modeling
Hide some tokens and ask the model to fill them in from context.
This is the objective used by BERT-style models.
The Pre-training Pipeline
- Data curation: collect, filter, deduplicate, and clean large corpora
- Tokenization: convert raw text or other inputs into model-ready units
- Training: optimize the model across many steps on large compute clusters
- Evaluation: track loss and benchmark behavior
What Makes Pre-training Hard
- huge compute requirements
- messy internet-scale data
- difficult decisions about data mixture and quality
- strong interactions between model size, tokens, and compute budget
That is where Scaling Laws become useful.
Pre-training vs Fine-Tuning
| Stage | Goal |
|---|---|
| Pre-training | Learn broad patterns from massive data |
| Fine-tuning | Adapt the model to a narrower behavior or task |
| Preference tuning / alignment | Make behavior more useful to people |
What To Remember
- Pre-training is the broad learning stage, not the final product
- It usually uses self-supervised objectives
- Most modern AI systems start with a strong pre-trained base model