Scaling Laws for Neural Language Models

Why bigger models, more data, and more compute lead to predictable gains

Scaling laws describe a practical pattern in modern ML: when you increase model size, training data, and compute in a sensible way, performance often improves smoothly and predictably rather than randomly.

Read Pre-training and GPT first if you want the surrounding context. This page is about the empirical rule behind “bigger models keep getting better.”

A Simple Mental Model

Suppose you have a fixed training budget. You could spend it on:

  • a larger model
  • more data
  • more training steps

Scaling-law papers ask: if I spend twice as much compute, how much better should I expect the model to get?

Their answer was surprising: the curve is often smooth enough that you can predict it before running the full experiment.

The Core Discovery

The original paper found that cross-entropy loss followed power laws with model size NN, data size DD, and compute CC:

L(N)=(NcN)αN,αN0.076L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076 L(D)=(DcD)αD,αD0.095L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095 L(C)=(CcC)αC,αC0.050L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050

The exact exponents are less important than the lesson: performance improves in a regular way as scale increases.

Why This Was a Big Deal

Before scaling laws, training larger models involved much more guesswork.

After scaling laws, teams could:

  • estimate expected gains before full runs
  • decide whether more compute was worth the cost
  • plan better trade-offs between parameters and data

Interactive Visualization

Explore how loss decreases with scale:

Neural Scaling Laws

Parameters (log scale)Loss
N^-0.076
Loss vs Parameters
C^-0.050
Loss vs Compute
D^-0.095
Loss vs Data
Key Insight
Performance follows smooth power laws across many orders of magnitude. Given a compute budget, you can predict optimal model size and training data.

Optimal Allocation

Given a compute budget CC, the original paper suggested:

NC0.73N^* \propto C^{0.73} DC0.27D^* \propto C^{0.27}

That meant: under the original estimate, you should spend more of the extra budget on model size than on data.

A Helpful Interpretation

If a student asks, “Does this mean giant models always win?” the better answer is:

  • larger models usually help
  • but only when you also scale data and compute sensibly
  • poor allocation can waste the budget

Scaling laws are about how to scale well, not just scaling blindly.

The Chinchilla Update

Later work, especially Chinchilla-style analysis, argued that the original paper under-trained large models relative to the amount of available data.

That updated view suggests a more balanced rule:

NC0.5,DC0.5N^* \propto C^{0.5}, \quad D^* \propto C^{0.5}

So a major lesson for students is that scaling laws are not one fixed formula forever. They are an evolving empirical picture.

Practical Implications

QuestionScaling-law answer
Should we expect smooth gains from more scale?Often yes
Can we predict those gains ahead of time?Roughly yes
Is bigger always enough by itself?No
Does data allocation matter?A lot

What To Remember

  • Scaling laws made large-model training more predictable
  • The important variables are model size, data, and compute
  • Later work refined the original story, especially around data-optimal training

Key Paper

Found an error or want to contribute? Edit on GitHub