Why bigger models, more data, and more compute lead to predictable gains
Scaling laws describe a practical pattern in modern ML: when you increase model size, training data, and compute in a sensible way, performance often improves smoothly and predictably rather than randomly.
Read Pre-training and GPT first if you want the surrounding context. This page is about the empirical rule behind “bigger models keep getting better.”
A Simple Mental Model
Suppose you have a fixed training budget. You could spend it on:
- a larger model
- more data
- more training steps
Scaling-law papers ask: if I spend twice as much compute, how much better should I expect the model to get?
Their answer was surprising: the curve is often smooth enough that you can predict it before running the full experiment.
The Core Discovery
The original paper found that cross-entropy loss followed power laws with model size , data size , and compute :
The exact exponents are less important than the lesson: performance improves in a regular way as scale increases.
Why This Was a Big Deal
Before scaling laws, training larger models involved much more guesswork.
After scaling laws, teams could:
- estimate expected gains before full runs
- decide whether more compute was worth the cost
- plan better trade-offs between parameters and data
Interactive Visualization
Explore how loss decreases with scale:
Neural Scaling Laws
Optimal Allocation
Given a compute budget , the original paper suggested:
That meant: under the original estimate, you should spend more of the extra budget on model size than on data.
A Helpful Interpretation
If a student asks, “Does this mean giant models always win?” the better answer is:
- larger models usually help
- but only when you also scale data and compute sensibly
- poor allocation can waste the budget
Scaling laws are about how to scale well, not just scaling blindly.
The Chinchilla Update
Later work, especially Chinchilla-style analysis, argued that the original paper under-trained large models relative to the amount of available data.
That updated view suggests a more balanced rule:
So a major lesson for students is that scaling laws are not one fixed formula forever. They are an evolving empirical picture.
Practical Implications
| Question | Scaling-law answer |
|---|---|
| Should we expect smooth gains from more scale? | Often yes |
| Can we predict those gains ahead of time? | Roughly yes |
| Is bigger always enough by itself? | No |
| Does data allocation matter? | A lot |
What To Remember
- Scaling laws made large-model training more predictable
- The important variables are model size, data, and compute
- Later work refined the original story, especially around data-optimal training
Key Paper
- Scaling Laws for Neural Language Models - Kaplan et al. (2020)
https://arxiv.org/abs/2001.08361