Bias and Variance

Bias and variance describe two different sources of prediction error. They are useful because they separate errors caused by an overly simple model from errors caused by an overly sensitive model.

Bias and variance describe two different sources of prediction error. They are useful because they separate errors caused by an overly simple model from errors caused by an overly sensitive model.

A model with high bias makes strong simplifying assumptions. It tends to underfit. A model with high variance changes too much when the training data changes. It tends to overfit.

The practical goal is to find a model that is flexible enough to learn the true pattern but stable enough to generalize to unseen data.

Prediction Error

Assume there is an unknown relationship between input (x) and target (y). A common regression model is

$$ y = f(x) + \epsilon, $$

where (f(x)) is the true signal and (\epsilon) is noise.

The learning algorithm sees a finite training set and produces an estimated function

$$ \hat{f}(x). $$

The prediction error at a point (x) is

$$ \mathbb{E}\left[(y-\hat{f}(x))^2\right]. $$

This error has three conceptual sources:

Source Meaning
Bias Error from wrong assumptions
Variance Error from sensitivity to the training set
Irreducible noise Randomness in the data itself

The first two can be affected by model design and training. The third cannot be removed by a better model if it is truly random.

Bias

Bias measures how far the average learned model is from the true function.

If we trained many models on many different training sets from the same distribution, each model would learn a slightly different function. The average of those learned functions may still be far from the true function. That gap is bias.

A high-bias model is too rigid. It cannot represent the true relationship well.

Examples:

Situation Why bias is high
Linear model for nonlinear data Model class is too simple
Very small neural network Insufficient capacity
Excessive regularization Model forced to be too smooth
Too few training epochs Optimization stops too early
Poor feature representation Important signal absent

High bias usually causes high training error and high validation error.

Variance

Variance measures how much the learned model changes when the training set changes.

A high-variance model is too sensitive to details of the training data. It may fit real patterns, but it may also fit noise, rare examples, and accidental correlations.

Examples:

Situation Why variance is high
Very large model on small data Too many degrees of freedom
Weak regularization Model can fit noise
Training too long Memorization increases
Noisy labels Model learns label errors
Data leakage in validation Selection becomes unstable

High variance usually causes low training error and high validation error.

Bias-Variance Decomposition

For squared error regression, prediction error can be decomposed into bias, variance, and noise.

Let (\hat{f}(x)) be the function learned from a random training set. Then:

$$ \mathbb{E}\left[(y-\hat{f}(x))^2\right] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2. $$

Here (\sigma^2) is irreducible noise.

The bias term is

$$ \text{Bias}[\hat{f}(x)] = \mathbb{E}[\hat{f}(x)] - f(x). $$

The variance term is

$$ \text{Var}[\hat{f}(x)] = \mathbb{E} \left[ \left( \hat{f}(x) - \mathbb{E}[\hat{f}(x)] \right)^2 \right]. $$

This decomposition is exact under standard squared-error assumptions. For classification and deep neural networks, the same intuition remains useful, but the algebra is less direct.

Underfitting and Overfitting

Underfitting occurs when the model cannot fit the training data well. It usually indicates high bias.

Overfitting occurs when the model fits training data much better than validation data. It usually indicates high variance.

Pattern Training loss Validation loss Likely problem
Underfitting High High High bias
Good fit Low Low Balanced
Overfitting Very low High High variance
Data or split problem Low Unstable or misleading Leakage, shift, noise

In practice, training and validation curves are the first diagnostic tool.

A high training loss means the model has not learned the training set. A large gap between training and validation loss means the model has learned the training set more than the underlying pattern.

Model Capacity

Model capacity is the ability of a model class to fit a wide range of functions.

A linear model has limited capacity. A deep neural network with many layers and parameters has much greater capacity.

Increasing capacity usually decreases bias and increases variance.

Change Bias Variance
Larger model Decreases Increases
More layers Decreases May increase
Stronger regularization Increases Decreases
More training data Similar Decreases
Better features Decreases May decrease
Early stopping Increases Decreases

The old bias-variance tradeoff suggests that one must choose between bias and variance. Modern deep learning complicates this picture because very large models can sometimes generalize well, especially with enough data, regularization, and suitable optimization.

Bias and Variance in Deep Learning

Classical theory often assumes models become worse after a certain capacity because variance grows. Deep learning often behaves differently.

Large neural networks can have enough parameters to fit the training data perfectly and still generalize well. This happens in many overparameterized regimes.

Several factors help explain this behavior:

Factor Effect
Large datasets Reduce variance
Stochastic gradient descent Introduces implicit regularization
Architecture design Encodes useful structure
Data augmentation Expands effective training data
Weight decay Limits parameter growth
Normalization Stabilizes optimization
Early stopping Prevents excessive fitting

Even so, the bias-variance language remains useful. When training loss is too high, increase capacity or improve optimization. When validation loss is too high relative to training loss, improve regularization, data quality, or splitting.

Diagnosing High Bias

A model likely has high bias when:

Symptom Interpretation
Training loss remains high Model cannot fit data
Validation loss remains high Poor generalization follows poor fit
Both curves plateau early Capacity or optimization limit
More training data does not help much Model class may be wrong
Predictions are overly smooth Model cannot represent detail

Ways to reduce bias:

  1. Use a larger model.
  2. Train longer.
  3. Reduce excessive regularization.
  4. Use a better architecture.
  5. Improve input features or preprocessing.
  6. Use a more suitable loss function.
  7. Tune the learning rate and optimizer.

In PyTorch, a high-bias model may be a small MLP:

model = torch.nn.Sequential(
    torch.nn.Linear(784, 32),
    torch.nn.ReLU(),
    torch.nn.Linear(32, 10),
)

Increasing hidden width may reduce bias:

model = torch.nn.Sequential(
    torch.nn.Linear(784, 512),
    torch.nn.ReLU(),
    torch.nn.Linear(512, 512),
    torch.nn.ReLU(),
    torch.nn.Linear(512, 10),
)

The second model has more capacity. It can represent more complex functions.

Diagnosing High Variance

A model likely has high variance when:

Symptom Interpretation
Training loss is very low Model fits training data
Validation loss is much higher Generalization gap
Validation metric fluctuates strongly Model or dataset instability
Performance depends heavily on random seed Training is sensitive
Small data changes alter results Model depends on sample noise

Ways to reduce variance:

  1. Add more training data.
  2. Use data augmentation.
  3. Increase weight decay.
  4. Add dropout.
  5. Use early stopping.
  6. Reduce model capacity.
  7. Improve label quality.
  8. Use ensembling.
  9. Use a better train-validation split.

Example with dropout and weight decay:

model = torch.nn.Sequential(
    torch.nn.Linear(784, 512),
    torch.nn.ReLU(),
    torch.nn.Dropout(p=0.3),
    torch.nn.Linear(512, 512),
    torch.nn.ReLU(),
    torch.nn.Dropout(p=0.3),
    torch.nn.Linear(512, 10),
)

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=0.01,
)

Dropout injects noise into hidden activations during training. Weight decay discourages large parameter values. Both can reduce overfitting.

Learning Curves

Learning curves plot performance as the amount of training data increases.

They help distinguish bias from variance.

A high-bias model usually has training and validation errors close together, both at poor values. Adding more data often gives little improvement.

A high-variance model usually has low training error and much higher validation error. Adding more data often helps.

Learning curve pattern Diagnosis
High train error, high validation error High bias
Low train error, high validation error High variance
Both improve with more data Data helps
Validation error stops improving Model, data, or objective limit

Learning curves are useful because they show whether more data is likely to help. If the model already underfits, collecting more data may be less useful than improving the model.

Validation Gap

The validation gap is the difference between validation error and training error.

For losses:

$$ \text{gap} = L_{\text{val}} - L_{\text{train}}. $$

For accuracy:

$$ \text{gap} = \text{Acc}{\text{train}} - \text{Acc}{\text{val}}. $$

A small gap with poor performance suggests underfitting. A large gap suggests overfitting.

In PyTorch-style training logs:

epoch 1: train_loss=1.95 val_loss=1.92
epoch 2: train_loss=1.70 val_loss=1.69
epoch 3: train_loss=1.55 val_loss=1.58

This looks like reasonable learning.

epoch 1: train_loss=1.30 val_loss=1.65
epoch 2: train_loss=0.72 val_loss=1.80
epoch 3: train_loss=0.25 val_loss=2.10

This suggests overfitting.

Irreducible Error

Some error remains even with the best possible model.

Sources include:

Source Example
Measurement noise Sensor error
Label ambiguity Multiple valid labels
Hidden variables Missing causal factors
Randomness Truly stochastic outcomes
Human disagreement Different annotators choose different labels

If two expert annotators disagree on a medical image, the model may have no single target that is always correct.

Irreducible error sets a ceiling on performance. The right response may be better data collection, better labels, or probabilistic prediction rather than a larger model.

Practical Checklist

When performance is poor, inspect the training and validation metrics.

Observation Likely action
Training loss high Increase capacity, train longer, tune optimizer
Training and validation both poor Reduce bias
Training good, validation poor Reduce variance
Validation unstable Use more data, stratified split, repeated runs
Test worse than validation Check test shift or validation overuse
All metrics poor despite large model Inspect labels, preprocessing, objective

Bias and variance are diagnostic tools. They do not replace error analysis, but they give a useful first map of the problem.

Summary

Bias is error from overly restrictive assumptions. Variance is error from excessive sensitivity to the training set.

High bias leads to underfitting. High variance leads to overfitting. Model capacity, data size, regularization, optimization, and architecture all affect the balance.

In deep learning, the classical tradeoff is modified by overparameterization, large datasets, and implicit regularization. The practical method remains the same: compare training and validation behavior, identify the dominant failure mode, and change the model or data accordingly.