Scaling Laws, Carefully

AI
Machine Learning
Infrastructure
Economics

Lilian Weng’s post is a technical overview of scaling laws in machine learning, with a focus on the now-famous result that training loss often follows simple power-law trends as model size, dataset size, and compute grow. The big claim is not that gains are infinite. It is that gains have been smooth and forecastable over huge ranges, which is why labs could justify enormous spending before the latest systems existed. That framing shaped most of the reaction. People who have worked on early scaling-law papers called the result genuinely shocking because deep learning looked too messy to be captured by a compact empirical equation. Several commenters treated that simplicity as the key fact of the last decade in AI. The sharper discussion was about scope. Scaling laws hold for a fixed error metric and data distribution, not as a magic law across every model generation. That matters because newer smaller models can beat older larger ones through better data curation, better training recipes, and transfer. The thread also pushed back on a common misread of “diminishing returns.” Power laws do imply diminishing marginal gains, but the practical lesson from Kaplan and later Chinchilla-style results was that the curve stayed smooth instead of hitting a visible wall, and that optimal compute and data tradeoffs were better than some early readings implied. A side debate focused on the entropy floor of language and whether next-token loss is nearing a hard cap that would limit capability gains. The more convincing reading was narrower: irreducible uncertainty in language is real, but being close to the entropy floor does not tell you how much capability headroom remains on rare, hard predictions that matter for reasoning and useful work.

If you build or fund AI products, treat scaling curves as a serious planning tool rather than hype. But do not confuse a reliable within-regime trend with a guarantee that the same recipe, metric, or data distribution will keep delivering the next generation of capability.

July 1, 2026
lilianweng.github.io
Discuss on HN

Key insights

Why early scaling laws felt absurd

What made the original result striking was not that more data helps. People already knew that. The surprise was that speech and language systems, with messy datasets and complicated optimization, still looked governed by a simple three-term empirical law. That reframes scaling laws as more than an engineering heuristic. They are evidence that parts of modern AI are far more regular and manufacturable than many researchers expected.

Do not dismiss clean empirical laws just because the system underneath looks chaotic. If you see stable scaling in your own models, treat it as a lever for planning and resource allocation, not just a curiosity.

Attribution:

gdiamos #1

Scaling laws are local to setup

The useful boundary is that these laws assume a particular loss metric and data distribution. Once you change the data, move to transfer, or improve curation, the old curve no longer tells the whole story. That is why a 31B model from a later generation can beat GPT-3 despite having far fewer parameters. Better data and training move you onto a better curve, not just farther along the old one.

When you compare model generations, do not read parameter count alone as progress. Track data quality, objective, and transfer regime, because those can dominate raw size.

Attribution:

gdiamos #1
nok22kon #1

Diminishing returns did not mean stagnation

The important fact was never that power laws have diminishing marginal returns. That is mathematically obvious. The important fact was that the decline stayed smooth over seven orders of magnitude instead of flattening into a wall. The Chinchilla result also changed the economics by showing a much better compute-data tradeoff than some early interpretations suggested, which helped justify continued scaling rather than ending it.

If you model frontier AI economics, use the latest compute-optimal results rather than quoting early scaling exponents in isolation. Small changes in the exponent radically change whether a roadmap looks dead or investable.

Attribution:

aspenmartin #1
an0malous #1

Language entropy is not a capability ceiling

Irreducible uncertainty in next-token prediction is real, so cross-entropy has a floor. But that floor does not map cleanly to useful capability. A model can sit close to the entropy limit and still have a lot of room to improve on rare, difficult tokens that carry most of the value in coding, reasoning, and long-horizon tasks. Treating the entropy constant as a near-term explanation for stalled progress overstates what the metric tells you.

Be careful using pretraining loss alone as a proxy for product ceiling. Inspect task mix and tail behavior, because the business value may live in a small slice of predictions that loss averages hide.

Attribution:

aspenmartin #1
FromTheFirstIn #1 #2

The pattern predated transformers

Proto scaling-law behavior showed up well before modern LLMs. A 2007 Jeff Dean paper on n-gram language models already found translation quality improving steadily with larger language models at the biggest scales they tested. That history weakens the idea that scaling laws are a transformer-era accident. The transformer boom amplified an older empirical pattern with better architectures and bigger budgets.

Look for recurring regularities across generations of methods. If a phenomenon survives architectural shifts, it is more likely to be a durable planning assumption.

Attribution:

ekelsen #1
beyonddream #1

Against the grain

Entropy may dominate sooner than believers expect

The skeptical case is that language has a hard irreducible uncertainty floor, and current LLM progress may already be running into it for broad next-token prediction. That would explain why coding keeps looking easier than open-ended language and why post-GPT-4 gains feel more like tooling and productization than a clean capability jump. It does not disprove scaling laws, but it does narrow what raw pretraining can buy.

If your roadmap assumes another step-change from just more pretraining, build a fallback plan around domain constraints, tools, and workflow integration. Raw model scaling may no longer be the only lever that matters.

Attribution:

FromTheFirstIn #1 #2

Behavioral imitation is not enough

Matching the distribution of intelligent text does not automatically settle the deeper question of whether the system is intelligent in the stronger sense people care about. That objection lands because many product claims quietly slide from “indistinguishable output” to “equivalent cognition.” For buyers and builders, those are not the same promise.

Write product requirements around reliability, reasoning depth, and task completion, not around vague claims of human-like output. Distributional similarity is useful, but it is not the same thing as dependable agency.

Attribution:

FromTheFirstIn #1

In plain english

Chinchilla ↩

A DeepMind scaling-law result showing that many large models had been undertrained on data and that better compute-data balance improves efficiency.

GPT-3 ↩

Generative Pre-trained Transformer 3, a 2020 large language model from OpenAI with about 175 billion parameters.

LLM ↩

Large Language Model, an AI model trained on large amounts of text and used for chatbots, coding tools, and agents.

transfer ↩

The ability of a model trained in one setting or on one dataset to perform well on related tasks or data.

Reference links

Scaling law papers

Scaling Laws for Transfer
Cited to show that scaling behavior changes in structured ways when you move from a fixed training setup to transfer.

Historical language model scaling

Very Large Scale Language Modeling for Statistical Machine Translation
Offered as an early pre-transformer example of scaling curves improving translation quality with larger language models.

Scaling Laws, Carefully

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Scaling law papers

Historical language model scaling