Overfitting and Underfitting

Overfitting and underfitting describe two common ways a model can fail.

Overfitting and underfitting describe two common ways a model can fail. A model underfits when it learns too little from the training data. A model overfits when it learns the training data too specifically and performs poorly on new data.

The goal is to find a model that captures the stable patterns in the data without memorizing accidental details.

The Central Problem

During training, a model minimizes loss on the training set:

$$ L_{\text{train}}(\theta). $$

But the real objective is good performance on unseen data:

$$ L_{\text{test}}(\theta). $$

The training loss is directly optimized. The test loss is only estimated through validation and test sets.

This creates the central tension. A model can reduce training loss by learning true structure, but it can also reduce training loss by memorizing noise, outliers, duplicate examples, or dataset artifacts.

Underfitting

Underfitting occurs when the model is too simple, poorly optimized, or incorrectly specified. It cannot represent the relationship between inputs and targets.

Common symptoms:

Symptom Meaning
Training loss is high Model cannot fit training data
Validation loss is high Poor fit carries over to unseen data
Training and validation curves are close Model fails similarly on both
Predictions are too simple Model misses important variation

Example: fitting a straight line to strongly nonlinear data. The model may find the best possible line, but the line still cannot express the true pattern.

In deep learning, underfitting may occur when a network has too few layers, too few hidden units, excessive regularization, poor features, a bad learning rate, or insufficient training time.

Overfitting

Overfitting occurs when the model fits the training data too closely. It learns patterns that do not generalize.

Common symptoms:

Symptom Meaning
Training loss is very low Model fits training examples
Validation loss is much higher Generalization gap
Validation loss starts increasing Model begins fitting noise
Performance depends heavily on split or seed Model is unstable

A high-capacity model can memorize labels, rare examples, and spurious correlations. For example, an image classifier may learn background patterns rather than object shape. A medical model may learn scanner-specific artifacts rather than disease features.

Training and Validation Curves

Training and validation curves are the simplest diagnostic tool.

A healthy training run often shows both training and validation loss decreasing at first. Later, validation loss may stop improving while training loss continues to fall.

Typical patterns:

Pattern Diagnosis
High train loss, high validation loss Underfitting
Low train loss, high validation loss Overfitting
Low train loss, low validation loss Good fit
Noisy validation loss Small validation set, unstable training, or high variance

For classification, the same idea applies to accuracy:

Pattern Diagnosis
Low train accuracy, low validation accuracy Underfitting
High train accuracy, low validation accuracy Overfitting
High train accuracy, high validation accuracy Good fit

Causes of Underfitting

Underfitting usually means the model, training process, or input representation lacks enough useful capacity.

Common causes:

Cause Example
Model too small Tiny MLP for image recognition
Training too short Too few epochs
Learning rate too low Optimization barely moves
Learning rate too high Optimization fails to settle
Excessive regularization Too much dropout or weight decay
Poor features Important input signal missing
Wrong architecture Linear model for structured sequence data
Poor loss function Objective mismatched to task

A model can underfit even if it has many parameters. Bad optimization, broken preprocessing, or incorrect targets can keep a large model from learning.

Causes of Overfitting

Overfitting occurs when the model has enough flexibility to fit unstable details in the training set.

Common causes:

Cause Example
Too little data Large model trained on small dataset
Model too large Excess capacity for the task
Weak regularization No dropout or weight decay
Training too long Memorization after useful learning
Noisy labels Model learns label mistakes
Data leakage Validation score becomes misleading
Duplicate examples Same sample appears across splits
Spurious correlations Background predicts class in training data

Overfitting is often a data problem as much as a model problem. More data, cleaner labels, better splits, and stronger augmentation can matter more than changing the architecture.

Reducing Underfitting

To reduce underfitting, make the learning problem easier for the model or make the model more capable.

Useful interventions:

Intervention Effect
Increase model capacity Allows richer functions
Train longer Gives optimizer more time
Tune learning rate Improves optimization
Reduce regularization Allows closer fit
Improve preprocessing Exposes useful signal
Use a better architecture Adds the right inductive bias
Use pretrained models Starts from useful representations

PyTorch example: increasing capacity.

import torch.nn as nn

small_model = nn.Sequential(
    nn.Linear(784, 64),
    nn.ReLU(),
    nn.Linear(64, 10),
)

larger_model = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(),
    nn.Linear(512, 512),
    nn.ReLU(),
    nn.Linear(512, 10),
)

The larger model can represent more complex decision boundaries. This may reduce underfitting, but it may also increase overfitting if data is limited.

Reducing Overfitting

To reduce overfitting, make the model less sensitive to accidental training-set details.

Useful interventions:

Intervention Effect
Add more data Reduces variance
Use data augmentation Expands effective data
Add weight decay Penalizes large weights
Add dropout Reduces co-adaptation
Use early stopping Stops before memorization dominates
Reduce model capacity Limits memorization
Improve split quality Prevents leakage
Clean labels Removes misleading targets
Use ensembling Averages unstable models

PyTorch example: dropout and weight decay.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(512, 512),
    nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(512, 10),
)

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=1e-2,
)

Dropout randomly removes activations during training. Weight decay penalizes large parameter values. Both can improve generalization.

Early Stopping

Early stopping is a simple and effective way to reduce overfitting.

The training procedure tracks validation loss after each epoch. If validation loss stops improving, training stops. The best checkpoint is usually the one with the lowest validation loss.

best_val_loss = float("inf")
patience = 5
bad_epochs = 0

for epoch in range(num_epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        bad_epochs = 0
        best_state = {
            k: v.detach().cpu().clone()
            for k, v in model.state_dict().items()
        }
    else:
        bad_epochs += 1

    if bad_epochs >= patience:
        break

model.load_state_dict(best_state)

Early stopping uses the validation set for model selection. The test set should remain untouched until the final evaluation.

Regularization

Regularization refers to methods that improve generalization by restricting or stabilizing the learned function.

Common regularizers include:

Method Mechanism
Weight decay Penalizes large weights
Dropout Randomly removes activations
Data augmentation Trains on transformed examples
Label smoothing Softens hard labels
Mixup Blends examples and labels
Stochastic depth Randomly drops layers
Noise injection Adds noise to inputs or activations

Regularization can reduce overfitting, but too much regularization can cause underfitting. For example, very high dropout may prevent the model from fitting even the training set.

Data Augmentation

Data augmentation creates modified versions of training examples while preserving the label.

For images, common augmentations include:

Augmentation Meaning
Random crop Changes framing
Horizontal flip Mirrors image
Color jitter Changes brightness or contrast
Rotation Changes orientation
Cutout Masks image regions
Mixup Blends two images

For text, augmentation is more delicate because small changes can alter meaning. For audio, augmentation may include noise injection, speed changes, and time masking.

Data augmentation teaches the model invariance. If a cat image is still a cat after cropping or color changes, the model should produce the same label.

Capacity Control

Capacity control means adjusting how flexible the model is.

A small model may underfit. A large model may overfit. The right capacity depends on data size, noise level, task difficulty, and regularization.

In classical machine learning, capacity is often controlled by choosing a model class. In deep learning, capacity is controlled by architecture and training choices:

Control Example
Width Number of hidden units
Depth Number of layers
Parameter sharing Convolutions, recurrent weights
Sparsity Mixture-of-experts routing
Regularization Dropout, weight decay
Training duration Early stopping

Parameter sharing is especially important. Convolutional layers can generalize better than fully connected layers on images because they encode translation structure.

Double Descent

Classical bias-variance theory suggests that test error decreases at first, then increases as model capacity becomes too large. Modern deep learning often shows a different pattern called double descent.

As model capacity increases:

  1. Test error decreases.
  2. Test error increases near the interpolation threshold.
  3. Test error decreases again for highly overparameterized models.

The interpolation threshold is the point where the model can fit the training data nearly perfectly.

Double descent helps explain why very large neural networks can generalize well despite having enough parameters to memorize the training set. It does not mean larger models always perform better. Data quality, optimization, regularization, and architecture still matter.

Practical Diagnosis

A practical workflow:

Observation Diagnosis Possible action
Train loss high, val loss high Underfitting Larger model, train longer, tune optimizer
Train loss low, val loss high Overfitting More data, regularization, augmentation
Train loss decreasing, val loss rising Overfitting during training Early stopping
Train and val both unstable Optimization or data issue Lower learning rate, inspect data
Test much worse than val Validation overuse or distribution shift Rebuild split, audit leakage
Train loss fails immediately Implementation bug Check labels, shapes, loss, dtype

The first response to poor performance should be measurement, not guessing. Plot curves. Inspect examples. Compare train and validation metrics. Check that the split matches the deployment setting.

PyTorch Evaluation Pattern

A reliable evaluation function avoids training-mode behavior and disables gradient tracking.

import torch

def evaluate(model, dataloader, loss_fn, device):
    model.eval()

    total_loss = 0.0
    total_examples = 0
    correct = 0

    with torch.no_grad():
        for x, y in dataloader:
            x = x.to(device)
            y = y.to(device)

            logits = model(x)
            loss = loss_fn(logits, y)

            batch_size = x.size(0)
            total_loss += loss.item() * batch_size
            total_examples += batch_size

            pred = logits.argmax(dim=-1)
            correct += (pred == y).sum().item()

    avg_loss = total_loss / total_examples
    accuracy = correct / total_examples

    return avg_loss, accuracy

The call model.eval() matters. Dropout and batch normalization behave differently during training and evaluation. Forgetting this call can produce misleading validation metrics.

Summary

Underfitting means the model learns too little. Training and validation performance are both poor. Overfitting means the model learns the training set too specifically. Training performance is strong, but validation performance is weak.

Reducing underfitting usually requires more capacity, better optimization, better features, or less regularization. Reducing overfitting usually requires more data, stronger regularization, better augmentation, early stopping, cleaner labels, or improved data splits.

Training and validation curves give the clearest first diagnosis.