AI Scaling Laws Explained: Why Bigger Models Get Smarter

Neural scaling laws describe how a model’s error falls predictably as you increase parameters, training data, and compute.
The 2020 Kaplan et al. paper showed test loss follows a smooth power law across seven orders of magnitude of compute.
The 2022 Chinchilla paper corrected the recipe: for compute-optimal training, model size and training tokens should scale roughly equally — most big models were badly undertrained.
Chinchilla’s rule of thumb: about 20 training tokens per parameter; doubling the model means doubling the data.
The open debate is whether high-quality training data is running out (the “data wall”) and whether returns are diminishing relative to soaring costs.

What scaling laws are

A neural scaling law is an empirical formula showing that a model’s prediction error decreases smoothly and predictably as you grow three inputs: the number of parameters, the amount of training data, and the compute spent. The relationship is a power law, meaning each doubling of resources buys a consistent fractional reduction in loss rather than a fixed one.

What makes this powerful is predictability. Because the curve is smooth across many orders of magnitude, researchers can train a few small models, fit the curve, and forecast how a far larger model will perform before spending millions of dollars to build it. Scaling laws turned model development from guesswork into something closer to engineering, and they are a major reason the foundation models era took the shape it did.

Why it’s a power law, not a straight line

On ordinary axes the improvement curve bends and flattens; plotted on log-log axes it becomes a near-straight line. That is the signature of a power law: loss falls in proportion to a resource raised to a negative exponent. The practical consequence is brutal — early gains are cheap, but to keep cutting error you must keep multiplying resources, because the same proportional improvement costs exponentially more in absolute terms.

Kaplan 2020: the original scaling laws

The foundational result came from OpenAI’s January 2020 paper “Scaling Laws for Neural Language Models” by Jared Kaplan, Sam McCandlish, and colleagues. Training over 200 transformer models, they found that test loss scales as a power law with model size, dataset size, and compute, with trends holding across more than seven orders of magnitude — a remarkably clean empirical regularity.

A key Kaplan finding shaped the next two years of the industry: given a fixed compute budget, performance improved most by making models bigger rather than training them on much more data. This conclusion encouraged a race toward ever-larger parameter counts — GPT-3’s 175 billion parameters in 2020, and the half-trillion-plus models that followed — while training-token counts grew comparatively slowly.

The consequence: a parameter race

Following Kaplan’s recipe, labs poured budget into parameters. Models like GPT-3, Gopher (280B), and Megatron-Turing NLG (530B) pushed size aggressively while being trained on, by later standards, surprisingly little data. The assumption that “bigger is the priority” went largely unchallenged until a single paper from DeepMind upended it.

Chinchilla 2022: the compute-optimal correction

In March 2022, DeepMind’s “Training Compute-Optimal Large Language Models” — universally called the Chinchilla paper — showed that the prevailing recipe was wrong. By training over 400 models from 70 million to 16 billion parameters on 5 billion to 500 billion tokens, Hoffmann et al. found that model size and training data should be scaled in equal proportion. As the paper states, “for every doubling of model size the number of training tokens should also be doubled.”

The implication was startling: most large models of the era, including the biggest, were significantly undertrained. To prove it, DeepMind trained Chinchilla — a 70-billion-parameter model — on 1.4 trillion tokens, four times more data than the 280-billion-parameter Gopher. Despite being four times smaller, Chinchilla outperformed Gopher across a wide range of benchmarks while being cheaper to run at inference time.

The 20-tokens-per-parameter rule

Chinchilla distilled into a memorable heuristic: train on roughly 20 tokens for every parameter to be compute-optimal. A 7-billion-parameter model “wants” around 140 billion tokens of training data. This reframed the whole field — data, not just parameters, became a first-class lever, and it explains why later open models such as Llama were trained on far more tokens than their size alone would suggest. Crucially, training a smaller-but-well-fed model also makes inference cheaper, which connects directly to the economics behind small language models.

The data bottleneck and the “data wall”

Chinchilla’s insight — that data matters as much as size — created a new worry: high-quality text may be a finite resource. A widely cited 2022 analysis by Epoch AI estimated that the stock of high-quality public text could be largely exhausted by models sometime between 2026 and 2032, depending on how fast training appetites grow. This looming constraint is often called the “data wall.”

Researchers are pursuing several escapes. One is synthetic data — using strong models to generate fresh training material, with mixed and debated results on quality. Another is multimodal data: images, audio, and video vastly expand the available signal beyond text. A third is squeezing more learning from existing data through better curation and repeated passes. Whether these fully substitute for fresh, human-written, high-quality text remains an open empirical question.

Do scaling laws still hold — and is it worth it?

The empirical scaling curves themselves have proven robust, but the economic and strategic case for naive scaling is increasingly debated. Loss keeps falling as predicted, yet each increment now costs dramatically more, and falling loss does not always translate cleanly into the capabilities users care about.

This has pushed the frontier in new directions rather than pure size. Sparse architectures like mixture of experts grow total parameters while keeping per-token compute modest, changing the cost equation. Meanwhile, “test-time compute” — spending more computation during inference so a model can reason longer — has opened a second scaling axis distinct from training-time scaling. Some researchers argue the cheapest pretraining gains are now behind us; others contend the curves still have a long way to run. The data is consistent with the laws holding; the disagreement is about cost, ceilings, and which axis to scale next.

Frequently asked questions

What is the difference between Kaplan and Chinchilla scaling laws?
Both describe power-law improvements in loss, but they disagree on how to allocate a compute budget. The 2020 Kaplan paper concluded that, for a fixed budget, you should prioritise making the model bigger over adding much more data. The 2022 Chinchilla paper showed this was suboptimal: model size and training tokens should grow in equal proportion, roughly 20 tokens per parameter. Chinchilla revealed that most large models of that era were badly undertrained.

What does “compute-optimal” mean?
Compute-optimal training means allocating a fixed amount of compute so as to achieve the lowest possible loss, balancing model size against the number of training tokens. The Chinchilla paper found the best balance is to scale both equally. A smaller model trained on more data can beat a larger model trained on less, for the same total compute, and it also costs less to run at inference time, which matters enormously at deployment scale.

Are we running out of training data?
Possibly, for high-quality public text. A 2022 Epoch AI analysis estimated the usable stock of such text could be largely consumed by training runs sometime between roughly 2026 and 2032, a constraint nicknamed the “data wall.” Researchers are responding with synthetic data, multimodal sources such as images and audio, and better curation of existing data. Whether these fully replace fresh human-written text at scale is still an open and actively studied question.

Do scaling laws mean bigger is always better?
Not straightforwardly. The curves predict that loss keeps falling with more parameters, data, and compute, but each gain costs disproportionately more, and lower loss does not always map onto the capabilities users want. The field has shifted toward sparse architectures like mixture of experts and toward spending compute at inference time for reasoning, rather than relying on raw model size alone. Bigger still helps, but it is no longer the only — or always the smartest — lever.