Automated Machine Learning

Automated machine learning, or AutoML, refers to systems that automate parts of the model development process. Hyperparameter optimization and neural architecture search are both parts of AutoML, but AutoML is broader. It may include data preprocessing, feature construction, model selection, training recipe selection, ensembling, compression, deployment, and monitoring.

In deep learning, AutoML usually means a system that searches over training configurations and model structures under a compute budget. The goal is not to remove human judgment. The goal is to make repeated experimental decisions systematic, logged, and reproducible.

What AutoML Optimizes

A deep learning project contains many choices. Some are numerical. Some are categorical. Some are structural. Some are operational.

Area	Examples
Data processing	Normalization, augmentation, tokenization, filtering
Model architecture	Depth, width, block type, attention type
Optimization	Optimizer, learning rate, weight decay, schedule
Regularization	Dropout, label smoothing, stochastic depth
Training system	Batch size, precision, gradient accumulation
Evaluation	Metric, validation split, threshold
Deployment	Quantization, pruning, latency target

An AutoML system defines a search space over these choices and evaluates candidate pipelines.

A complete configuration might include:

config = {
    "model": {
        "family": "transformer",
        "num_layers": 12,
        "hidden_dim": 768,
        "num_heads": 12,
        "mlp_ratio": 4,
    },
    "optimizer": {
        "name": "AdamW",
        "learning_rate": 3e-4,
        "weight_decay": 0.01,
    },
    "training": {
        "batch_size": 128,
        "epochs": 20,
        "mixed_precision": True,
    },
    "regularization": {
        "dropout": 0.1,
        "label_smoothing": 0.05,
    },
}

The configuration is then passed to a training pipeline.

Pipeline Search

AutoML searches over pipelines, not only isolated hyperparameters.

A pipeline is a sequence of decisions:

$$ \text{data} \rightarrow \text{preprocessing} \rightarrow \text{model} \rightarrow \text{training} \rightarrow \text{evaluation} \rightarrow \text{deployment}. $$

Each stage may have its own search space.

For image classification, the pipeline may include:

Stage	Search choices
Input	Image resolution
Augmentation	Crop scale, color jitter, mixup, cutmix
Model	CNN, ViT, hybrid model
Optimizer	SGD, AdamW
Schedule	cosine decay, step decay, warmup
Regularization	dropout, stochastic depth
Inference	batch size, quantization

For NLP, the pipeline may include:

Stage	Search choices
Tokenization	vocabulary size, subword method
Sequence handling	max length, truncation, packing
Model	encoder, decoder, encoder-decoder
Objective	masked LM, causal LM, contrastive
Fine-tuning	full fine-tune, LoRA, adapters
Retrieval	chunk size, top-k documents
Inference	decoding strategy, temperature

This broader view distinguishes AutoML from ordinary hyperparameter search.

AutoML as Nested Optimization

AutoML can be described as nested optimization.

The inner optimization trains model parameters:

$$ \theta^\ast(c) = \arg\min_{\theta} \mathcal{L}_{\text{train}}(\theta;c), $$

where $c$ is a full pipeline configuration.

The outer optimization selects the configuration:

$$ c^\ast = \arg\min_{c\in\mathcal{C}} \mathcal{L}_{\text{val}}(\theta^\ast(c);c). $$

Here $\mathcal{C}$ is the AutoML search space.

The outer loop is expensive because every candidate configuration may require training. This is why AutoML systems rely on early stopping, pruning, surrogate models, low-fidelity evaluations, and parallel execution.

Search Strategies

AutoML systems combine search strategies from earlier sections.

Strategy	Role in AutoML
Grid search	Small fixed spaces
Random search	Strong baseline for broad search
Bayesian optimization	Sample-efficient search
Population-based training	Dynamic hyperparameter schedules
Neural architecture search	Structural model choices
Multi-fidelity search	Cheap approximations before full training
Evolutionary algorithms	Irregular and discrete spaces

A mature AutoML system may use several strategies at once. For example, it may use random search for initial trials, Bayesian optimization after enough observations, pruning for poor configurations, and final retraining for the top candidates.

Multi-Fidelity Optimization

Full training is expensive. Multi-fidelity optimization evaluates many configurations cheaply, then spends more compute on promising ones.

Lower-fidelity evaluations include:

Lower fidelity method	Approximation
Fewer epochs	Train briefly
Smaller dataset	Train on a subset
Lower resolution	Use smaller images
Shorter sequence length	Use fewer tokens
Smaller model	Use proxy architecture
Fewer diffusion steps	Approximate generation quality

The assumption is that cheap evaluations are correlated with full training results.

Successive halving is a common multi-fidelity strategy. It begins with many configurations trained for a small budget. It keeps the best fraction and increases their budgets.

Hyperband extends this idea by trying multiple budget schedules.

A simplified successive halving loop:

configs = sample_many(search_space, n=64)
budget = 1

while len(configs) > 1:
    results = []

    for config in configs:
        score = train_and_evaluate(config, epochs=budget)
        results.append((score, config))

    results.sort(reverse=True)
    configs = [config for score, config in results[:len(results) // 2]]
    budget *= 2

This avoids fully training configurations that perform poorly early.

AutoML with PyTorch

A practical PyTorch AutoML system usually has four layers.

Layer	Responsibility
Configuration schema	Defines all tunable choices
Model factory	Builds models from configs
Training engine	Runs training and evaluation
Search controller	Chooses configs and records results

A simple model factory:

def build_model(config):
    family = config["model"]["family"]

    if family == "mlp":
        return MLP(
            input_dim=config["data"]["input_dim"],
            hidden_dim=config["model"]["hidden_dim"],
            output_dim=config["data"]["num_classes"],
            num_layers=config["model"]["num_layers"],
            dropout=config["regularization"]["dropout"],
        )

    if family == "cnn":
        return CNN(
            num_classes=config["data"]["num_classes"],
            channels=config["model"]["channels"],
            blocks=config["model"]["blocks"],
        )

    raise ValueError(f"unknown model family: {family}")

A training function:

def run_trial(config):
    model = build_model(config)
    train_loader, val_loader = build_loaders(config)

    optimizer = build_optimizer(model, config)
    scheduler = build_scheduler(optimizer, config)

    for epoch in range(config["training"]["epochs"]):
        train_one_epoch(model, train_loader, optimizer)
        scheduler.step()

    return evaluate(model, val_loader)

The search controller can call run_trial(config) using random search, Bayesian optimization, or another strategy.

Configuration Schemas

AutoML systems need explicit configuration schemas. The schema defines valid fields, types, defaults, and constraints.

A loose dictionary is easy to start with, but a schema becomes important as experiments grow.

Example schema style:

from dataclasses import dataclass

@dataclass
class ModelConfig:
    family: str
    hidden_dim: int
    num_layers: int
    dropout: float

@dataclass
class OptimizerConfig:
    name: str
    learning_rate: float
    weight_decay: float

@dataclass
class TrainingConfig:
    batch_size: int
    epochs: int
    mixed_precision: bool

@dataclass
class Config:
    model: ModelConfig
    optimizer: OptimizerConfig
    training: TrainingConfig

A schema prevents accidental errors such as misspelled keys, missing fields, or invalid types. This matters because AutoML may run hundreds of experiments without human inspection.

Experiment Tracking

AutoML produces many trials. Each trial should be logged completely.

At minimum, log:

Item	Purpose
Full configuration	Reproduce the run
Validation metrics	Compare candidates
Training metrics	Diagnose learning behavior
Random seed	Reproduce stochastic choices
Code version	Match implementation
Dataset version	Avoid data drift
Hardware information	Interpret speed and memory
Failure reason	Debug invalid regions
Checkpoint path	Reload top models

Without tracking, AutoML results become difficult to trust. The best trial may be impossible to reproduce.

A simple log record:

record = {
    "trial_id": trial_id,
    "config": config,
    "val_accuracy": val_accuracy,
    "val_loss": val_loss,
    "seed": seed,
    "status": "completed",
}

For production work, these records are usually stored in a database or experiment tracking system.

Constraints and Deployment Objectives

AutoML often optimizes under constraints.

For example:

$$ \text{maximize accuracy} $$

subject to

$$ \text{latency} \le 20\text{ ms}, $$

$$ \text{memory} \le 512\text{ MB}. $$

The system can reject configurations that violate hard constraints or include penalties in the objective.

A scalar objective may be:

$$ J(c) = \text{accuracy}(c) - \alpha \cdot \max(0,\text{latency}(c)-20) - \beta \cdot \max(0,\text{memory}(c)-512). $$

Hard constraints are easier to interpret. Soft penalties are easier to optimize. In deployment-sensitive deep learning, both are common.

Ensembling and Model Selection

Some AutoML systems produce not one model but a set of models. An ensemble combines predictions from multiple trained models.

For classification, an ensemble may average probabilities:

$$ p(y\mid x) = \frac{1}{M} \sum_{m=1}^{M} p_m(y\mid x). $$

Ensembles often improve accuracy and calibration, but they increase inference cost. They are useful when validation performance matters more than latency.

For deployment, the selected model may be the best single model under a cost budget. For competitions, the selected model may be an ensemble of top trials.

AutoML for Foundation Models

For foundation models, full AutoML over architecture and pretraining is usually too expensive. Instead, automation is applied to smaller decisions.

Common targets include:

Area	Automated choices
Fine-tuning	Learning rate, LoRA rank, batch size, epochs
Retrieval	chunk size, embedding model, top-k, reranker
Prompting	prompt templates, demonstrations, decoding
Alignment	reward model settings, preference data mixture
Inference	temperature, top-p, max tokens, caching
Compression	quantization level, distillation target

For example, retrieval-augmented generation may search over:

rag_space = {
    "chunk_size": [256, 512, 1024],
    "chunk_overlap": [0, 64, 128],
    "top_k": [3, 5, 10, 20],
    "rerank": [True, False],
}

The objective may be answer accuracy, citation quality, latency, and cost.

Human Judgment in AutoML

AutoML does not make model development automatic in the strong sense. It automates search within a space chosen by humans.

Human judgment remains necessary for:

Decision	Why it matters
Problem formulation	Defines what should be optimized
Data quality	Often dominates model performance
Search space design	Determines what can be found
Metric choice	Controls what the system prefers
Constraint design	Reflects deployment reality
Result interpretation	Detects spurious wins
Final validation	Prevents test-set leakage

A system can optimize the wrong objective very efficiently. This is a common failure mode.

Common Failure Modes

AutoML can fail in predictable ways.

First, it can overfit the validation set. Testing hundreds or thousands of configurations increases the chance that one looks good by noise.

Second, it can exploit metric weaknesses. If the metric is incomplete, the search may find configurations that improve the metric while harming real performance.

Third, it can use unfair comparisons. Some trials may receive more compute, better preprocessing, or different data.

Fourth, it can ignore deployment constraints. A model with high validation accuracy may be too large, slow, or expensive.

Fifth, it can produce irreproducible results when random seeds, code versions, or dataset versions are missing.

Practical Guidelines

Start with a strong manual baseline. AutoML should improve on a known reference, not replace basic modeling discipline.

Keep the first search space small. Search only high-impact choices such as learning rate, weight decay, batch size, dropout, and one or two architecture dimensions.

Use multi-fidelity methods when full training is expensive. Promote promising configurations and stop weak ones early.

Retrain the best configurations from scratch. This checks whether the result is stable.

Reserve a final test set. Use it only once the search process is complete.

Log everything needed to reproduce the result.

Summary

Automated machine learning searches over training pipelines, model choices, hyperparameters, and deployment settings. In deep learning, it combines hyperparameter optimization, neural architecture search, pruning, multi-fidelity evaluation, experiment tracking, and final model selection.

AutoML is useful when the search space is well designed, metrics are reliable, and experiments are logged carefully. It is most effective as a disciplined engineering system around model development, rather than a substitute for understanding data, objectives, and deployment constraints.