Scaling Laws for Neural Language Models

Scaling laws describe a practical pattern in modern ML: when you increase model size, training data, and compute in a sensible way, performance often improves smoothly and predictably rather than randomly.

Read Pre-training and GPT first if you want the surrounding context. This page is about the empirical rule behind “bigger models keep getting better.”

A Simple Mental Model

Suppose you have a fixed training budget. You could spend it on:

a larger model
more data
more training steps

Scaling-law papers ask: if I spend twice as much compute, how much better should I expect the model to get?

Their answer was surprising: the curve is often smooth enough that you can predict it before running the full experiment.

The Core Discovery

The original paper found that cross-entropy loss followed power laws with model size $N$ , data size $D$ , and compute $C$ :

L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076

L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095

L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050

The exact exponents are less important than the lesson: performance improves in a regular way as scale increases.

Why This Was a Big Deal

Before scaling laws, training larger models involved much more guesswork.

After scaling laws, teams could:

estimate expected gains before full runs
decide whether more compute was worth the cost
plan better trade-offs between parameters and data

Interactive Visualization

Explore how loss decreases with scale:

Neural Scaling Laws

N^-0.076

Loss vs Parameters

C^-0.050

Loss vs Compute

D^-0.095

Loss vs Data

Key Insight

Performance follows smooth power laws across many orders of magnitude. Given a compute budget, you can predict optimal model size and training data.

Optimal Allocation

Given a compute budget $C$ , the original paper suggested:

N^* \propto C^{0.73}

D^* \propto C^{0.27}

That meant: under the original estimate, you should spend more of the extra budget on model size than on data.

A Helpful Interpretation

If a student asks, “Does this mean giant models always win?” the better answer is:

larger models usually help
but only when you also scale data and compute sensibly
poor allocation can waste the budget

Scaling laws are about how to scale well, not just scaling blindly.

The Chinchilla Update

Later work, especially Chinchilla-style analysis, argued that the original paper under-trained large models relative to the amount of available data.

That updated view suggests a more balanced rule:

N^* \propto C^{0.5}, \quad D^* \propto C^{0.5}

So a major lesson for students is that scaling laws are not one fixed formula forever. They are an evolving empirical picture.

Practical Implications

Question	Scaling-law answer
Should we expect smooth gains from more scale?	Often yes
Can we predict those gains ahead of time?	Roughly yes
Is bigger always enough by itself?	No
Does data allocation matter?	A lot

What To Remember

Scaling laws made large-model training more predictable
The important variables are model size, data, and compute
Later work refined the original story, especially around data-optimal training

Key Paper

Scaling Laws for Neural Language Models - Kaplan et al. (2020)
https://arxiv.org/abs/2001.08361