Empirical laws governing how language model performance scales with compute, data, and parameters
Scaling Laws for Neural Language Models discovered that language model performance follows remarkably smooth power laws across many orders of magnitude. This paper provided the theoretical foundation for training ever-larger models.
The Core Discovery
Cross-entropy loss follows power laws in:
where = parameters, = dataset size, = compute.
The Unified Law
When optimally allocating a compute budget :
This means: double your compute, get predictable improvement.
Interactive Visualization
Explore how loss decreases with scale:
Neural Scaling Laws
Optimal Allocation
Given compute budget , how to split between model size and training data?
Key insight: Larger models are more sample-efficient. As compute grows, invest mostly in parameters.
What Doesn’t Matter (Much)
The paper found these have minimal impact on scaling:
- Model shape (depth vs width)
- Learning rate schedule details
- Batch size (within reason)
- Context length (once sufficient)
The power law exponents remain remarkably constant.
Practical Implications
| Compute Budget | Optimal Parameters | Tokens |
|---|---|---|
| 10^18 FLOPs | ~100M | ~2B |
| 10^21 FLOPs | ~1B | ~20B |
| 10^24 FLOPs | ~10B | ~200B |
Why This Matters
Before this paper, scaling was empirical guesswork. After:
- Predictability: Know performance before training
- Investment justification: Larger compute → guaranteed gains
- Research direction: Showed the path to GPT-3 and beyond
The Chinchilla Update
Note: Hoffmann et al. (2022) later showed the original laws underestimated optimal data. “Chinchilla scaling” suggests:
Equal investment in parameters and data may be better.
Key Paper
- Scaling Laws for Neural Language Models — Kaplan et al. (2020)
https://arxiv.org/abs/2001.08361