Scaling Laws for Language Models

Scaling laws describe how model performance changes as we increase compute, parameter count, dataset size, and training tokens.

Scaling laws describe how model performance changes as we increase compute, parameter count, dataset size, and training tokens. They matter because large language models are expensive to train. Before spending millions of GPU-hours, we want a principled estimate of what a given training run is likely to achieve.

A scaling law usually relates training resources to loss. For language models, the main measured quantity is often cross-entropy loss on held-out text. Lower loss means the model assigns higher probability to the validation data.

The basic observation is simple: when model size, data size, and compute increase together, language modeling loss tends to decrease in a predictable way.

The Scaling Variables

There are three central variables:

Symbol Meaning
$N$ Number of model parameters
$D$ Number of training tokens
$C$ Training compute
$L$ Validation loss

A transformer training run has a rough compute cost proportional to

$$ C \propto N D. $$

This approximation ignores constants, architecture details, sequence length effects, optimizer overhead, activation recomputation, and hardware efficiency. Still, it captures the dominant tradeoff: a larger model trained on the same number of tokens costs more, and the same model trained for more tokens also costs more.

A scaling law asks questions such as:

Question Meaning
If we double parameters, how much does loss improve? Model-size scaling
If we double training tokens, how much does loss improve? Data scaling
If we double compute, how should we split it between $N$ and $D$? Compute-optimal scaling
At fixed compute, should we train a smaller model longer or a larger model shorter? Allocation problem

The last question is the most important in practice.

Power-Law Behavior

Empirically, language model loss often follows an approximate power law. A simple form is

$$ L(N) \approx L_\infty + aN^{-\alpha}, $$

where $L_\infty$ is an irreducible loss term, $a$ is a constant, and $\alpha$ controls how quickly loss improves as the model grows.

A similar relationship can be written for dataset size:

$$ L(D) \approx L_\infty + bD^{-\beta}. $$

These equations say that improvement continues with scale, but each additional unit of scale gives a smaller gain than the previous one. This is diminishing returns.

For example, increasing a model from 100 million to 1 billion parameters may reduce loss substantially. Increasing from 100 billion to 1 trillion parameters may still help, but the improvement per additional parameter is smaller.

The practical lesson is that scale helps, but it must be allocated carefully.

Model-Limited and Data-Limited Regimes

A training run can be limited by model size or by data.

In a model-limited regime, the dataset is large enough, but the model is too small to absorb the available structure. Adding parameters helps.

In a data-limited regime, the model is large enough, but it has seen too little data. Training a larger model may waste compute because the model cannot generalize well from insufficient tokens. Adding data or training for more tokens helps.

A small model trained on massive data may underfit because it lacks capacity. A huge model trained on too little data may overfit or fail to use its capacity efficiently.

Compute-optimal training balances these two regimes.

Compute-Optimal Training

Suppose we have a fixed compute budget $C$. Since compute is roughly proportional to $ND$, we cannot increase both without limit. We must choose a model size $N$ and token budget $D$.

A compute-optimal training rule tells us how $N$ and $D$ should grow as $C$ grows.

Earlier large language model practice often favored very large models trained for relatively few tokens. Later empirical work showed that many large models were undertrained relative to their size. In compute-optimal training, a smaller model trained on more tokens can outperform a larger model trained on fewer tokens at the same compute budget.

The tradeoff can be summarized as:

Choice Effect
Larger $N$, smaller $D$ More capacity, less practice
Smaller $N$, larger $D$ Less capacity, more practice
Balanced $N$ and $D$ Better compute efficiency

This matters for deployment too. A smaller model trained longer may have lower inference cost than a larger undertrained model with similar quality.

Tokens per Parameter

A common heuristic is to compare the number of training tokens to the number of model parameters.

$$ \text{tokens per parameter} = \frac{D}{N}. $$

This ratio gives a rough sense of whether the model has been trained long enough for its size.

If the ratio is too low, the model may be undertrained. If the ratio is very high, the model may be trained heavily relative to its capacity. The best ratio depends on the architecture, data quality, tokenizer, objective, compute budget, and target use case.

For many modern dense transformer models, compute-optimal training uses substantially more tokens per parameter than early GPT-style scaling practice. This is why recent models often train smaller or medium-size models on very large token counts.

Scaling Data Quality

Scaling laws are often written in terms of token count, but not all tokens have equal value.

A trillion low-quality tokens may train a worse model than a smaller, cleaner, more diverse corpus. Data quality affects loss, downstream performance, factuality, toxicity, code ability, multilingual ability, and reasoning behavior.

Important data properties include:

Data property Effect
Deduplication Reduces memorization and benchmark leakage
Filtering Removes low-quality or harmful documents
Diversity Improves domain coverage
Code mixture Improves programming and formal reasoning
Math mixture Improves symbolic and quantitative reasoning
Multilingual balance Improves non-English performance
Recency Improves knowledge of recent facts
Document structure Improves long-context behavior

Scaling token count without controlling quality can produce misleading results. A model trained on more tokens may improve validation loss while becoming worse for specific tasks.

Scaling Architecture

Scaling laws are usually measured within an architecture family. A law estimated for one transformer design may not transfer exactly to another design.

Architectural choices affect scaling efficiency:

Choice Scaling effect
Depth More sequential computation, stronger hierarchical processing
Width More parallel capacity per layer
Attention heads More subspace interactions
Context length More long-range conditioning, higher attention cost
Feedforward size More token-wise transformation capacity
Normalization placement Affects stability
Activation function Affects optimization and expressivity
Mixture-of-experts Increases parameters without activating all of them per token

A dense transformer activates all parameters for each token. A mixture-of-experts model activates only part of the network per token. This changes the relation between parameter count, compute, and performance. For this reason, total parameters and active parameters should be reported separately.

Scaling Context Length

Context length is another axis of scale. A model with a longer context window can condition on more tokens at inference time.

However, standard attention has quadratic cost in sequence length:

$$ \text{attention cost} \propto T^2. $$

Doubling context length can make attention much more expensive, especially during training.

Long-context scaling introduces several questions:

Question Why it matters
Can the model use the full context? Long windows are useless if attention ignores distant tokens
Was the model trained on long sequences? Extrapolating beyond training length is unreliable
Does retrieval work better? External retrieval may beat brute-force context expansion
How is position represented? Positional encoding affects length generalization
What is the inference memory cost? KV cache grows with context length

A longer context window increases capacity for document-level tasks, codebases, multi-turn dialogue, and retrieval-augmented generation. It also increases serving cost.

Scaling Inference

Training scaling and inference scaling differ.

Training cost is paid once. Inference cost is paid every time the model is used. A model that is cheap to train but expensive to serve may be unattractive in production.

Inference cost depends on:

Factor Effect
Parameter count Larger models require more memory bandwidth and compute
Generated length More output tokens increase cost
Context length Longer prompts increase prefill cost
Batch size Larger batches improve throughput but may increase latency
KV cache size Long contexts require more memory
Quantization Reduces memory and can improve throughput
Speculative decoding Reduces latency by drafting tokens with a smaller model

For many applications, the best model is not the largest model. It is the model that gives sufficient quality at acceptable latency and cost.

Emergent Abilities and Smooth Scaling

Some abilities appear to emerge suddenly as models become larger. Examples often include multi-step reasoning, in-context learning, instruction following, tool use, and code synthesis.

However, apparent emergence can sometimes result from the metric. If a task is graded with a strict threshold, smooth improvement in underlying probability may look like sudden improvement in accuracy. For example, a model may gradually assign more probability to correct answers, but accuracy only rises once the correct answer becomes the top choice.

This means we should be careful with claims of emergence. Some abilities may reflect genuinely new internal organization at scale. Others may reflect measurement artifacts, prompting changes, or benchmark saturation.

Scaling Laws and Evaluation

Validation loss is useful because it is stable, cheap to measure, and directly tied to the pretraining objective. But it is incomplete.

A lower language modeling loss may correlate with better downstream performance, but the relationship varies by task. For example, loss may improve while factual calibration, safety, or instruction following remain weak.

A practical evaluation suite should include:

Evaluation type Example
Language modeling Held-out perplexity
Knowledge Question answering
Reasoning Math and logic tasks
Code Unit-tested programming problems
Instruction following Human or model-graded tasks
Robustness Distribution-shift tests
Safety Harmful request refusal and jailbreak resistance
Calibration Confidence versus correctness
Efficiency Latency, throughput, memory, cost

Scaling laws help forecast loss. They do not replace evaluation.

Budgeting a Training Run

A training plan should specify:

Item Description
Target model size Parameters, layers, width, heads
Token budget Number of tokens and data mixture
Compute budget GPU type, count, duration, utilization
Sequence length Training context window
Batch size Global tokens per optimizer step
Optimizer Usually AdamW or related variants
Learning rate schedule Warmup, decay, final learning rate
Precision fp32, fp16, bf16, or mixed
Checkpoint policy Frequency and retention
Evaluation suite Loss and downstream tasks
Stop criteria Compute budget, loss target, or overfitting signal

Scaling estimates should be conservative. Hardware failures, data pipeline bottlenecks, optimizer instability, and poor utilization can dominate the practical cost.

PyTorch View: Counting Tokens and Parameters

In PyTorch, parameter count can be computed directly:

def count_parameters(model):
    return sum(p.numel() for p in model.parameters())

Trainable parameter count excludes frozen parameters:

def count_trainable_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

Token count is usually tracked by the data pipeline. If each batch has shape [B, T], then each step processes approximately

tokens_per_step = B * T

For distributed training with world_size workers:

global_tokens_per_step = micro_batch_size * sequence_length * grad_accum_steps * world_size

Total tokens after num_steps optimizer steps:

total_tokens = global_tokens_per_step * num_steps

A simple training logger should record both parameter count and cumulative token count:

params = count_parameters(model)

for step, batch in enumerate(loader):
    input_ids = batch["input_ids"]          # [B, T]
    tokens_seen += input_ids.numel()

    logits = model(input_ids[:, :-1])
    loss = loss_fn(logits, input_ids[:, 1:])

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

    if step % 100 == 0:
        print({
            "step": step,
            "params": params,
            "tokens_seen": tokens_seen,
            "loss": float(loss.item()),
        })

This is not enough for a production training run, but it captures the basic accounting needed to reason about scale.

Practical Lessons

Scaling laws give several practical rules.

First, train on enough tokens. A large model trained on too little data often wastes compute.

Second, report both model size and token count. Parameter count alone says little about training quality.

Third, treat data quality as a scaling variable. More data helps only when the additional data improves the training distribution.

Fourth, evaluate beyond loss. Pretraining loss is important, but downstream behavior depends on adaptation, prompting, decoding, retrieval, and safety work.

Fifth, optimize for the full lifecycle. Training cost matters, but inference cost often dominates over time.

Summary

Scaling laws describe predictable relationships between loss, model size, data size, and compute. They help answer a central engineering question: for a fixed budget, how large should the model be, and how long should it be trained?

The key variables are parameters $N$, training tokens $D$, compute $C$, and validation loss $L$. Since compute is roughly proportional to $ND$, scaling requires a tradeoff between model capacity and training duration.

Modern language model training favors compute-balanced choices: enough parameters to learn rich structure, enough tokens to train those parameters well, and enough evaluation to detect failures that loss alone cannot show.