Data Leakage and Experimental Design

Data leakage occurs when information that should be unavailable during training or evaluation enters the modeling process. It causes performance estimates to look better than they really are.

A model with leakage may appear accurate in experiments and fail after deployment. This is one of the most common reasons machine learning systems disappoint in production.

What Data Leakage Means

A clean experiment separates information by role.

Training data is used to fit parameters. Validation data is used to choose models and hyperparameters. Test data is used only for final evaluation.

Leakage breaks this separation.

For example, suppose we normalize a dataset using the mean and standard deviation of all examples before splitting into train, validation, and test sets. The test set has influenced preprocessing. The model has received information about the test distribution before evaluation.

Correct procedure:

mean = train_data.mean()
std = train_data.std()

train_data = (train_data - mean) / std
val_data = (val_data - mean) / std
test_data = (test_data - mean) / std

The validation and test sets use statistics computed from the training set only.

Common Sources of Leakage

Leakage can be obvious or subtle.

Leakage type	Example
Duplicate leakage	Same example appears in train and test
Preprocessing leakage	Statistics computed on full dataset
Label leakage	Input feature directly encodes the target
Temporal leakage	Model uses future information
Group leakage	Same user, patient, document, or video appears in multiple splits
Hyperparameter leakage	Test set used repeatedly for model selection
Augmentation leakage	Augmented versions of same sample split across train and test

Duplicate leakage is especially common in web-scale datasets. A model may appear to generalize, while it is partly recalling repeated examples.

Label Leakage

Label leakage happens when the input contains information derived from the target.

Suppose we predict whether a patient will be readmitted to a hospital. If the feature table includes “readmission billing code,” the model can solve the task using information that would only exist after the event.

Another example: predicting whether a user will cancel a subscription while including a feature called cancellation_date.

The model may achieve high validation accuracy, but the result is meaningless. The feature would not be available at prediction time.

A good experimental design asks:

At the moment of prediction, would this information actually be known?

If the answer is no, the feature should be removed.

Temporal Leakage

Temporal leakage occurs when training uses information from the future.

This is common in forecasting, recommendation systems, finance, logs, and user behavior modeling.

For example, suppose we train a recommender system using all user interactions from January to December, then evaluate predictions for March. The model has already seen behavior from April through December, which would not have existed in March.

A temporal split avoids this:

Split	Time period
Training	January to August
Validation	September
Test	October

For deployment-like evaluation, the model should train on past data and predict future data.

Group Leakage

Group leakage occurs when related examples are split across training and evaluation sets.

Examples:

Domain	Group identifier
Medical imaging	Patient ID
Speech recognition	Speaker ID
Recommendation	User ID
Documents	Source document
Video classification	Video ID
Web classification	Domain or website

If images from the same patient appear in both training and test sets, the model may learn patient-specific artifacts. This does not measure generalization to new patients.

Use group-based splitting when the deployment task requires generalization to new groups.

In Python, group splitting can be done with scikit-learn:

from sklearn.model_selection import GroupShuffleSplit

splitter = GroupShuffleSplit(
    n_splits=1,
    test_size=0.2,
    random_state=42,
)

train_idx, test_idx = next(
    splitter.split(X, y, groups=patient_ids)
)

Preprocessing Leakage

Preprocessing must be fit only on the training set.

This applies to:

Preprocessing step	Fit using
Mean and standard deviation	Training set only
Vocabulary construction	Training set only
Feature selection	Training set only
Imputation values	Training set only
PCA components	Training set only
Tokenizer adaptation	Training set only
Class weights	Training set only

For example, if missing values are filled using the median of the whole dataset, then test information leaks into training.

Correct pattern:

imputer.fit(train_features)

train_features = imputer.transform(train_features)
val_features = imputer.transform(val_features)
test_features = imputer.transform(test_features)

The operation is fit on training data and applied to the other splits.

Leakage Through Model Selection

The test set should not guide model choice.

If we evaluate ten models on the test set and choose the one with the best test score, the test set has become a validation set. The selected model’s test score is biased upward.

Correct workflow:

Train candidate models on the training set.
Compare candidates on the validation set.
Select one final model.
Evaluate once on the test set.

If the test set is used repeatedly, create a new held-out test set or report that the original test score is no longer a clean final estimate.

Experimental Design

Experimental design defines how evidence is produced. A good experiment answers a clear question under controlled conditions.

For deep learning, an experiment should specify:

Component	Example
Dataset	Source, size, filters, split rule
Task	Classification, regression, retrieval
Inputs	Available features at prediction time
Target	Label definition and time horizon
Model	Architecture and parameter count
Loss	Training objective
Metrics	Primary and diagnostic metrics
Baselines	Simple and strong comparisons
Random seeds	Repeated runs when needed
Compute budget	Training steps, hardware, precision
Selection rule	How the final model is chosen

Without this information, a reported score is hard to interpret.

Baselines

A baseline is a simpler system used for comparison.

A deep model should be compared against reasonable baselines. Otherwise, improvement is difficult to judge.

Examples:

Task	Baseline
Classification	Majority class, logistic regression
Regression	Predict mean, linear regression
Image classification	Small CNN, pretrained ResNet
Text classification	Bag-of-words linear model
Retrieval	BM25
Forecasting	Last-value predictor
Recommendation	Popularity ranking

A baseline prevents false progress. If a large neural network barely beats a simple model, the added complexity may not be justified.

Ablation Studies

An ablation study removes or changes one component at a time to measure its contribution.

For example, suppose a model uses:

Data augmentation
Dropout
Weight decay
Pretraining

An ablation study might train variants without each component:

Variant	Purpose
Full model	Reference system
Without augmentation	Measure augmentation contribution
Without dropout	Measure dropout contribution
Without weight decay	Measure weight decay contribution
Without pretraining	Measure pretraining contribution

Ablations help separate real improvements from accidental effects.

Reproducibility

Reproducibility means that another run, or another researcher, can obtain the same result within expected variation.

Deep learning experiments are affected by random initialization, data order, augmentation, nondeterministic kernels, and hardware differences.

A basic reproducibility setup:

import random
import numpy as np
import torch

seed = 42

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

For stronger determinism:

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Full determinism can reduce performance and may not be possible for every operation. Still, recording seeds and environment details is essential.

Reporting Results

A result should include enough detail to be checked.

A minimal report includes:

Field	Example
Dataset version	`imagenet-1k-2012`
Split rule	Stratified 80/10/10
Model	ResNet-50
Optimizer	AdamW
Learning rate	(3 \times 10^{-4})
Batch size	256
Epochs or steps	90 epochs
Primary metric	Top-1 accuracy
Random seeds	3 runs
Mean and variation	(76.2 \pm 0.2)%
Hardware	8 A100 GPUs
Precision	bfloat16 mixed precision

Single-run scores can be misleading, especially on small datasets. Repeated runs make results more credible.

Evaluation by Slice

Aggregate scores hide failures.

A model may perform well overall while failing on rare classes, long inputs, specific languages, new users, certain devices, or recent data.

Slice evaluation computes metrics on meaningful subsets.

Slice type	Example
Class	Per-class accuracy
Length	Short versus long sequences
Source	Website, sensor, hospital
Time	Old versus recent examples
Geography	Region or country
Difficulty	Easy versus hard examples

Slice evaluation is especially important when the model will be used in high-stakes or heterogeneous environments.

Checklist for Clean Experiments

Before trusting a result, check:

Question	Why it matters
Was the split created before preprocessing?	Prevents preprocessing leakage
Are duplicates removed across splits?	Prevents memorization
Are related examples grouped correctly?	Prevents group leakage
Does the split respect time?	Prevents future information leakage
Are test scores used only once?	Prevents model-selection bias
Is the baseline strong enough?	Prevents exaggerated claims
Are metrics aligned with costs?	Prevents optimizing the wrong behavior
Are random seeds recorded?	Supports reproducibility
Are failure slices inspected?	Reveals hidden weaknesses

Summary

Data leakage gives models information they should not have. It can enter through duplicates, preprocessing, labels, time, groups, augmentation, or repeated test-set use.

Good experimental design prevents leakage, defines clear splits, uses proper baselines, reports reproducible details, and evaluates meaningful slices.

A deep learning result is only as trustworthy as the experiment that produced it.