Overfitting and Underfitting

Overfitting and underfitting describe two common ways a model can fail. A model underfits when it learns too little from the training data. A model overfits when it learns the training data too specifically and performs poorly on new data.

The goal is to find a model that captures the stable patterns in the data without memorizing accidental details.

The Central Problem

During training, a model minimizes loss on the training set:

$$ L_{\text{train}}(\theta). $$

But the real objective is good performance on unseen data:

$$ L_{\text{test}}(\theta). $$

The training loss is directly optimized. The test loss is only estimated through validation and test sets.

This creates the central tension. A model can reduce training loss by learning true structure, but it can also reduce training loss by memorizing noise, outliers, duplicate examples, or dataset artifacts.

Underfitting

Underfitting occurs when the model is too simple, poorly optimized, or incorrectly specified. It cannot represent the relationship between inputs and targets.

Common symptoms:

Symptom	Meaning
Training loss is high	Model cannot fit training data
Validation loss is high	Poor fit carries over to unseen data
Training and validation curves are close	Model fails similarly on both
Predictions are too simple	Model misses important variation

Example: fitting a straight line to strongly nonlinear data. The model may find the best possible line, but the line still cannot express the true pattern.

In deep learning, underfitting may occur when a network has too few layers, too few hidden units, excessive regularization, poor features, a bad learning rate, or insufficient training time.

Overfitting

Overfitting occurs when the model fits the training data too closely. It learns patterns that do not generalize.

Common symptoms:

Symptom	Meaning
Training loss is very low	Model fits training examples
Validation loss is much higher	Generalization gap
Validation loss starts increasing	Model begins fitting noise
Performance depends heavily on split or seed	Model is unstable

A high-capacity model can memorize labels, rare examples, and spurious correlations. For example, an image classifier may learn background patterns rather than object shape. A medical model may learn scanner-specific artifacts rather than disease features.

Training and Validation Curves

Training and validation curves are the simplest diagnostic tool.

A healthy training run often shows both training and validation loss decreasing at first. Later, validation loss may stop improving while training loss continues to fall.

Typical patterns:

Pattern	Diagnosis
High train loss, high validation loss	Underfitting
Low train loss, high validation loss	Overfitting
Low train loss, low validation loss	Good fit
Noisy validation loss	Small validation set, unstable training, or high variance

For classification, the same idea applies to accuracy:

Pattern	Diagnosis
Low train accuracy, low validation accuracy	Underfitting
High train accuracy, low validation accuracy	Overfitting
High train accuracy, high validation accuracy	Good fit

Causes of Underfitting

Underfitting usually means the model, training process, or input representation lacks enough useful capacity.

Common causes:

Cause	Example
Model too small	Tiny MLP for image recognition
Training too short	Too few epochs
Learning rate too low	Optimization barely moves
Learning rate too high	Optimization fails to settle
Excessive regularization	Too much dropout or weight decay
Poor features	Important input signal missing
Wrong architecture	Linear model for structured sequence data
Poor loss function	Objective mismatched to task

A model can underfit even if it has many parameters. Bad optimization, broken preprocessing, or incorrect targets can keep a large model from learning.

Causes of Overfitting

Overfitting occurs when the model has enough flexibility to fit unstable details in the training set.

Common causes:

Cause	Example
Too little data	Large model trained on small dataset
Model too large	Excess capacity for the task
Weak regularization	No dropout or weight decay
Training too long	Memorization after useful learning
Noisy labels	Model learns label mistakes
Data leakage	Validation score becomes misleading
Duplicate examples	Same sample appears across splits
Spurious correlations	Background predicts class in training data

Overfitting is often a data problem as much as a model problem. More data, cleaner labels, better splits, and stronger augmentation can matter more than changing the architecture.

Reducing Underfitting

To reduce underfitting, make the learning problem easier for the model or make the model more capable.

Useful interventions:

Intervention	Effect
Increase model capacity	Allows richer functions
Train longer	Gives optimizer more time
Tune learning rate	Improves optimization
Reduce regularization	Allows closer fit
Improve preprocessing	Exposes useful signal
Use a better architecture	Adds the right inductive bias
Use pretrained models	Starts from useful representations

PyTorch example: increasing capacity.

import torch.nn as nn

small_model = nn.Sequential(
    nn.Linear(784, 64),
    nn.ReLU(),
    nn.Linear(64, 10),
)

larger_model = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(),
    nn.Linear(512, 512),
    nn.ReLU(),
    nn.Linear(512, 10),
)

The larger model can represent more complex decision boundaries. This may reduce underfitting, but it may also increase overfitting if data is limited.

Reducing Overfitting

To reduce overfitting, make the model less sensitive to accidental training-set details.

Useful interventions:

Intervention	Effect
Add more data	Reduces variance
Use data augmentation	Expands effective data
Add weight decay	Penalizes large weights
Add dropout	Reduces co-adaptation
Use early stopping	Stops before memorization dominates
Reduce model capacity	Limits memorization
Improve split quality	Prevents leakage
Clean labels	Removes misleading targets
Use ensembling	Averages unstable models

PyTorch example: dropout and weight decay.

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(512, 512),
    nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(512, 10),
)

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=3e-4,
    weight_decay=1e-2,
)

Dropout randomly removes activations during training. Weight decay penalizes large parameter values. Both can improve generalization.

Early Stopping

Early stopping is a simple and effective way to reduce overfitting.

The training procedure tracks validation loss after each epoch. If validation loss stops improving, training stops. The best checkpoint is usually the one with the lowest validation loss.

best_val_loss = float("inf")
patience = 5
bad_epochs = 0

for epoch in range(num_epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        bad_epochs = 0
        best_state = {
            k: v.detach().cpu().clone()
            for k, v in model.state_dict().items()
        }
    else:
        bad_epochs += 1

    if bad_epochs >= patience:
        break

model.load_state_dict(best_state)

Early stopping uses the validation set for model selection. The test set should remain untouched until the final evaluation.

Regularization

Regularization refers to methods that improve generalization by restricting or stabilizing the learned function.

Common regularizers include:

Method	Mechanism
Weight decay	Penalizes large weights
Dropout	Randomly removes activations
Data augmentation	Trains on transformed examples
Label smoothing	Softens hard labels
Mixup	Blends examples and labels
Stochastic depth	Randomly drops layers
Noise injection	Adds noise to inputs or activations

Regularization can reduce overfitting, but too much regularization can cause underfitting. For example, very high dropout may prevent the model from fitting even the training set.

Data Augmentation

Data augmentation creates modified versions of training examples while preserving the label.

For images, common augmentations include:

Augmentation	Meaning
Random crop	Changes framing
Horizontal flip	Mirrors image
Color jitter	Changes brightness or contrast
Rotation	Changes orientation
Cutout	Masks image regions
Mixup	Blends two images

For text, augmentation is more delicate because small changes can alter meaning. For audio, augmentation may include noise injection, speed changes, and time masking.

Data augmentation teaches the model invariance. If a cat image is still a cat after cropping or color changes, the model should produce the same label.

Capacity Control

Capacity control means adjusting how flexible the model is.

A small model may underfit. A large model may overfit. The right capacity depends on data size, noise level, task difficulty, and regularization.

In classical machine learning, capacity is often controlled by choosing a model class. In deep learning, capacity is controlled by architecture and training choices:

Control	Example
Width	Number of hidden units
Depth	Number of layers
Parameter sharing	Convolutions, recurrent weights
Sparsity	Mixture-of-experts routing
Regularization	Dropout, weight decay
Training duration	Early stopping

Parameter sharing is especially important. Convolutional layers can generalize better than fully connected layers on images because they encode translation structure.

Double Descent

Classical bias-variance theory suggests that test error decreases at first, then increases as model capacity becomes too large. Modern deep learning often shows a different pattern called double descent.

As model capacity increases:

Test error decreases.
Test error increases near the interpolation threshold.
Test error decreases again for highly overparameterized models.

The interpolation threshold is the point where the model can fit the training data nearly perfectly.

Double descent helps explain why very large neural networks can generalize well despite having enough parameters to memorize the training set. It does not mean larger models always perform better. Data quality, optimization, regularization, and architecture still matter.

Practical Diagnosis

A practical workflow:

Observation	Diagnosis	Possible action
Train loss high, val loss high	Underfitting	Larger model, train longer, tune optimizer
Train loss low, val loss high	Overfitting	More data, regularization, augmentation
Train loss decreasing, val loss rising	Overfitting during training	Early stopping
Train and val both unstable	Optimization or data issue	Lower learning rate, inspect data
Test much worse than val	Validation overuse or distribution shift	Rebuild split, audit leakage
Train loss fails immediately	Implementation bug	Check labels, shapes, loss, dtype

The first response to poor performance should be measurement, not guessing. Plot curves. Inspect examples. Compare train and validation metrics. Check that the split matches the deployment setting.

PyTorch Evaluation Pattern

A reliable evaluation function avoids training-mode behavior and disables gradient tracking.

import torch

def evaluate(model, dataloader, loss_fn, device):
    model.eval()

    total_loss = 0.0
    total_examples = 0
    correct = 0

    with torch.no_grad():
        for x, y in dataloader:
            x = x.to(device)
            y = y.to(device)

            logits = model(x)
            loss = loss_fn(logits, y)

            batch_size = x.size(0)
            total_loss += loss.item() * batch_size
            total_examples += batch_size

            pred = logits.argmax(dim=-1)
            correct += (pred == y).sum().item()

    avg_loss = total_loss / total_examples
    accuracy = correct / total_examples

    return avg_loss, accuracy

The call model.eval() matters. Dropout and batch normalization behave differently during training and evaluation. Forgetting this call can produce misleading validation metrics.

Summary

Underfitting means the model learns too little. Training and validation performance are both poor. Overfitting means the model learns the training set too specifically. Training performance is strong, but validation performance is weak.

Reducing underfitting usually requires more capacity, better optimization, better features, or less regularization. Reducing overfitting usually requires more data, stronger regularization, better augmentation, early stopping, cleaner labels, or improved data splits.

Training and validation curves give the clearest first diagnosis.