Data Leakage and Experimental Design

Data leakage occurs when information that should be unavailable during training or evaluation enters the modeling process. It causes performance estimates to look better than they really are.

Data leakage occurs when information that should be unavailable during training or evaluation enters the modeling process. It causes performance estimates to look better than they really are.

A model with leakage may appear accurate in experiments and fail after deployment. This is one of the most common reasons machine learning systems disappoint in production.

What Data Leakage Means

A clean experiment separates information by role.

Training data is used to fit parameters. Validation data is used to choose models and hyperparameters. Test data is used only for final evaluation.

Leakage breaks this separation.

For example, suppose we normalize a dataset using the mean and standard deviation of all examples before splitting into train, validation, and test sets. The test set has influenced preprocessing. The model has received information about the test distribution before evaluation.

Correct procedure:

mean = train_data.mean()
std = train_data.std()

train_data = (train_data - mean) / std
val_data = (val_data - mean) / std
test_data = (test_data - mean) / std

The validation and test sets use statistics computed from the training set only.

Common Sources of Leakage

Leakage can be obvious or subtle.

Leakage type Example
Duplicate leakage Same example appears in train and test
Preprocessing leakage Statistics computed on full dataset
Label leakage Input feature directly encodes the target
Temporal leakage Model uses future information
Group leakage Same user, patient, document, or video appears in multiple splits
Hyperparameter leakage Test set used repeatedly for model selection
Augmentation leakage Augmented versions of same sample split across train and test

Duplicate leakage is especially common in web-scale datasets. A model may appear to generalize, while it is partly recalling repeated examples.

Label Leakage

Label leakage happens when the input contains information derived from the target.

Suppose we predict whether a patient will be readmitted to a hospital. If the feature table includes “readmission billing code,” the model can solve the task using information that would only exist after the event.

Another example: predicting whether a user will cancel a subscription while including a feature called cancellation_date.

The model may achieve high validation accuracy, but the result is meaningless. The feature would not be available at prediction time.

A good experimental design asks:

At the moment of prediction, would this information actually be known?

If the answer is no, the feature should be removed.

Temporal Leakage

Temporal leakage occurs when training uses information from the future.

This is common in forecasting, recommendation systems, finance, logs, and user behavior modeling.

For example, suppose we train a recommender system using all user interactions from January to December, then evaluate predictions for March. The model has already seen behavior from April through December, which would not have existed in March.

A temporal split avoids this:

Split Time period
Training January to August
Validation September
Test October

For deployment-like evaluation, the model should train on past data and predict future data.

Group Leakage

Group leakage occurs when related examples are split across training and evaluation sets.

Examples:

Domain Group identifier
Medical imaging Patient ID
Speech recognition Speaker ID
Recommendation User ID
Documents Source document
Video classification Video ID
Web classification Domain or website

If images from the same patient appear in both training and test sets, the model may learn patient-specific artifacts. This does not measure generalization to new patients.

Use group-based splitting when the deployment task requires generalization to new groups.

In Python, group splitting can be done with scikit-learn:

from sklearn.model_selection import GroupShuffleSplit

splitter = GroupShuffleSplit(
    n_splits=1,
    test_size=0.2,
    random_state=42,
)

train_idx, test_idx = next(
    splitter.split(X, y, groups=patient_ids)
)

Preprocessing Leakage

Preprocessing must be fit only on the training set.

This applies to:

Preprocessing step Fit using
Mean and standard deviation Training set only
Vocabulary construction Training set only
Feature selection Training set only
Imputation values Training set only
PCA components Training set only
Tokenizer adaptation Training set only
Class weights Training set only

For example, if missing values are filled using the median of the whole dataset, then test information leaks into training.

Correct pattern:

imputer.fit(train_features)

train_features = imputer.transform(train_features)
val_features = imputer.transform(val_features)
test_features = imputer.transform(test_features)

The operation is fit on training data and applied to the other splits.

Leakage Through Model Selection

The test set should not guide model choice.

If we evaluate ten models on the test set and choose the one with the best test score, the test set has become a validation set. The selected model’s test score is biased upward.

Correct workflow:

  1. Train candidate models on the training set.
  2. Compare candidates on the validation set.
  3. Select one final model.
  4. Evaluate once on the test set.

If the test set is used repeatedly, create a new held-out test set or report that the original test score is no longer a clean final estimate.

Experimental Design

Experimental design defines how evidence is produced. A good experiment answers a clear question under controlled conditions.

For deep learning, an experiment should specify:

Component Example
Dataset Source, size, filters, split rule
Task Classification, regression, retrieval
Inputs Available features at prediction time
Target Label definition and time horizon
Model Architecture and parameter count
Loss Training objective
Metrics Primary and diagnostic metrics
Baselines Simple and strong comparisons
Random seeds Repeated runs when needed
Compute budget Training steps, hardware, precision
Selection rule How the final model is chosen

Without this information, a reported score is hard to interpret.

Baselines

A baseline is a simpler system used for comparison.

A deep model should be compared against reasonable baselines. Otherwise, improvement is difficult to judge.

Examples:

Task Baseline
Classification Majority class, logistic regression
Regression Predict mean, linear regression
Image classification Small CNN, pretrained ResNet
Text classification Bag-of-words linear model
Retrieval BM25
Forecasting Last-value predictor
Recommendation Popularity ranking

A baseline prevents false progress. If a large neural network barely beats a simple model, the added complexity may not be justified.

Ablation Studies

An ablation study removes or changes one component at a time to measure its contribution.

For example, suppose a model uses:

  1. Data augmentation
  2. Dropout
  3. Weight decay
  4. Pretraining

An ablation study might train variants without each component:

Variant Purpose
Full model Reference system
Without augmentation Measure augmentation contribution
Without dropout Measure dropout contribution
Without weight decay Measure weight decay contribution
Without pretraining Measure pretraining contribution

Ablations help separate real improvements from accidental effects.

Reproducibility

Reproducibility means that another run, or another researcher, can obtain the same result within expected variation.

Deep learning experiments are affected by random initialization, data order, augmentation, nondeterministic kernels, and hardware differences.

A basic reproducibility setup:

import random
import numpy as np
import torch

seed = 42

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

For stronger determinism:

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Full determinism can reduce performance and may not be possible for every operation. Still, recording seeds and environment details is essential.

Reporting Results

A result should include enough detail to be checked.

A minimal report includes:

Field Example
Dataset version imagenet-1k-2012
Split rule Stratified 80/10/10
Model ResNet-50
Optimizer AdamW
Learning rate (3 \times 10^{-4})
Batch size 256
Epochs or steps 90 epochs
Primary metric Top-1 accuracy
Random seeds 3 runs
Mean and variation (76.2 \pm 0.2)%
Hardware 8 A100 GPUs
Precision bfloat16 mixed precision

Single-run scores can be misleading, especially on small datasets. Repeated runs make results more credible.

Evaluation by Slice

Aggregate scores hide failures.

A model may perform well overall while failing on rare classes, long inputs, specific languages, new users, certain devices, or recent data.

Slice evaluation computes metrics on meaningful subsets.

Slice type Example
Class Per-class accuracy
Length Short versus long sequences
Source Website, sensor, hospital
Time Old versus recent examples
Geography Region or country
Difficulty Easy versus hard examples

Slice evaluation is especially important when the model will be used in high-stakes or heterogeneous environments.

Checklist for Clean Experiments

Before trusting a result, check:

Question Why it matters
Was the split created before preprocessing? Prevents preprocessing leakage
Are duplicates removed across splits? Prevents memorization
Are related examples grouped correctly? Prevents group leakage
Does the split respect time? Prevents future information leakage
Are test scores used only once? Prevents model-selection bias
Is the baseline strong enough? Prevents exaggerated claims
Are metrics aligned with costs? Prevents optimizing the wrong behavior
Are random seeds recorded? Supports reproducibility
Are failure slices inspected? Reveals hidden weaknesses

Summary

Data leakage gives models information they should not have. It can enter through duplicates, preprocessing, labels, time, groups, augmentation, or repeated test-set use.

Good experimental design prevents leakage, defines clear splits, uses proper baselines, reports reproducible details, and evaluates meaningful slices.

A deep learning result is only as trustworthy as the experiment that produced it.