In 2020, OpenAI researchers noticed something strange.
They plotted how wrong their language models were (loss) against how much compute they used. Then against how much data. Then against how many parameters. Each time: a straight line on a log-log graph.
This wasn’t obvious. Machine learning could have had diminishing returns—twice the compute for 10% improvement. Or chaotic returns—sometimes better, sometimes worse. Instead: power laws. Double the compute, get predictable improvement. Every time.
This discovery changed AI from alchemy into engineering.
What Scaling Laws Actually Say ¶
The core finding: loss scales as a power law with compute, data, and parameters.
Loss is how wrong the model is—technically, how surprised it is by the next word. Lower loss means better predictions.
The relationship (simplified):
L ∝ C^(-α)
Where L is loss, C is compute, and α is around 0.05-0.1. Similar relationships hold for data and parameters.
What this means in practice:
| Compute | Relative Loss |
|---|---|
| 1x | 1.00 |
| 10x | ~0.80 |
| 100x | ~0.65 |
| 1000x | ~0.50 |
The returns diminish in absolute terms but remain predictable. Spend 10x more, get a known improvement. Spend 100x more, get a known (larger) improvement.
This predictability is the breakthrough. Not the specific numbers—the fact that there are specific numbers.
The Three Curves ¶
Scaling laws describe three separate relationships:
Compute scaling: More FLOPs (floating point operations) → lower loss. This is the total math done during training.
Data scaling: More tokens (words/subwords) → lower loss. This is the size of the training set.
Parameter scaling: More weights in the model → lower loss. This is the model size.
Loss
↑
|╲
| ╲
| ╲
| ╲___________
+――――――――――――――→ log(Compute/Data/Parameters)
All three are log-linear. All three are predictable. But they interact—you can’t just scale one and ignore the others.
Chinchilla: The Plot Twist ¶
In 2022, DeepMind published the Chinchilla paper and upended the conventional wisdom.
The prevailing approach: make the model as big as possible, train it on whatever data you have. GPT-3 had 175 billion parameters. Surely bigger was better.
Chinchilla showed this was wrong. GPT-3 was undertrained—too many parameters for the amount of data it saw. It was like building a massive factory and only running it for a day.
The Chinchilla finding: for a fixed compute budget, you should scale parameters and data roughly equally. Double the parameters, double the data. The ratio matters.
| Model | Parameters | Training Tokens | Ratio |
|---|---|---|---|
| GPT-3 | 175B | 300B | 1.7:1 |
| Chinchilla | 70B | 1.4T | 20:1 |
Chinchilla, with fewer parameters but more training data, outperformed GPT-3.
This reframed the game. It’s not just about building bigger models. It’s about compute-optimal training—balancing model size and data to get the most out of your compute budget.
Why This Changed Everything ¶
Before scaling laws, AI research looked like alchemy. Try an architecture. Try a training trick. See if it works. Intuition and luck mattered as much as science.
After scaling laws, AI research looks more like engineering. You have a budget. You can calculate the expected performance. You can plan.
This has massive implications:
Investment becomes rational. Venture capitalists hate uncertainty. Scaling laws provide something rare in tech: predictable returns. Spend $100M on compute, get a model of approximately known capability. This is why billions now flow into AI.
Planning becomes possible. OpenAI reportedly planned GPT-4’s capabilities before training it—using scaling laws to extrapolate from smaller runs. You can prototype cheap and predict expensive.
Competition becomes capital-intensive. If performance scales predictably with compute, whoever spends most wins (mostly). This favors deep pockets: Microsoft, Google, Meta, nation-states.
The Implications ¶
Scaling laws reshape who can compete in AI.
It’s a capital game. Training frontier models costs hundreds of millions of dollars. This isn’t a garage startup game anymore. You need datacenters, power contracts, custom hardware.
Data becomes the bottleneck. We’re running out of high-quality internet text. Common Crawl has been scraped to exhaustion. The next frontier is synthetic data, proprietary data, multimodal data. Whoever has data others don’t has an advantage.
Incumbents have the edge. Google has data (Search, YouTube, Gmail). Microsoft has compute (Azure) and an OpenAI partnership. Meta has data (Facebook, Instagram) and open-sources to commoditize competitors. Startups can compete at the application layer, but the foundation model layer is consolidating.
But scale isn’t everything. Architectural improvements (attention, mixture of experts) shift the curves. RLHF and fine-tuning add capabilities that raw pretraining doesn’t capture. Inference efficiency matters for deployment. Scale is necessary, not sufficient.
The Limits ¶
Scaling laws predict loss—how well the model compresses text. They don’t predict capabilities—what the model can actually do.
This is where things get strange. Loss decreases smoothly. But capabilities emerge suddenly. A model goes from “can’t do arithmetic” to “can do arithmetic” somewhere between 10B and 100B parameters. The scaling curves don’t tell you where.
There’s also the question of ceilings. Do scaling laws continue forever? Or do they hit diminishing returns at some point? We don’t know yet. Current models are still on the predictable part of the curve.
For more on emergent capabilities, phase transitions, and what scaling laws miss, see Emergent Phenomena: From Ants to Transformers.
Closing ¶
Scaling laws are the closest thing AI has to physics. Predictable relationships between inputs (compute, data, parameters) and outputs (loss).
They tell you that spending more will make models better. They tell you roughly how much better. They let you plan, budget, and extrapolate.
What they don’t tell you: what “better” will mean in practice. Lower loss is a proxy. Whether that proxy translates into usefulness, intelligence, or something stranger—that’s still being discovered.
For now, the formula is simple: more compute, better models. The question is what happens when “better” becomes “qualitatively different.”