Automated Machine Learning

Automated machine learning, or AutoML, refers to systems that automate parts of the model development process.

Automated machine learning, or AutoML, refers to systems that automate parts of the model development process. Hyperparameter optimization and neural architecture search are both parts of AutoML, but AutoML is broader. It may include data preprocessing, feature construction, model selection, training recipe selection, ensembling, compression, deployment, and monitoring.

In deep learning, AutoML usually means a system that searches over training configurations and model structures under a compute budget. The goal is not to remove human judgment. The goal is to make repeated experimental decisions systematic, logged, and reproducible.

What AutoML Optimizes

A deep learning project contains many choices. Some are numerical. Some are categorical. Some are structural. Some are operational.

Area Examples
Data processing Normalization, augmentation, tokenization, filtering
Model architecture Depth, width, block type, attention type
Optimization Optimizer, learning rate, weight decay, schedule
Regularization Dropout, label smoothing, stochastic depth
Training system Batch size, precision, gradient accumulation
Evaluation Metric, validation split, threshold
Deployment Quantization, pruning, latency target

An AutoML system defines a search space over these choices and evaluates candidate pipelines.

A complete configuration might include:

config = {
    "model": {
        "family": "transformer",
        "num_layers": 12,
        "hidden_dim": 768,
        "num_heads": 12,
        "mlp_ratio": 4,
    },
    "optimizer": {
        "name": "AdamW",
        "learning_rate": 3e-4,
        "weight_decay": 0.01,
    },
    "training": {
        "batch_size": 128,
        "epochs": 20,
        "mixed_precision": True,
    },
    "regularization": {
        "dropout": 0.1,
        "label_smoothing": 0.05,
    },
}

The configuration is then passed to a training pipeline.

AutoML searches over pipelines, not only isolated hyperparameters.

A pipeline is a sequence of decisions:

$$ \text{data} \rightarrow \text{preprocessing} \rightarrow \text{model} \rightarrow \text{training} \rightarrow \text{evaluation} \rightarrow \text{deployment}. $$

Each stage may have its own search space.

For image classification, the pipeline may include:

Stage Search choices
Input Image resolution
Augmentation Crop scale, color jitter, mixup, cutmix
Model CNN, ViT, hybrid model
Optimizer SGD, AdamW
Schedule cosine decay, step decay, warmup
Regularization dropout, stochastic depth
Inference batch size, quantization

For NLP, the pipeline may include:

Stage Search choices
Tokenization vocabulary size, subword method
Sequence handling max length, truncation, packing
Model encoder, decoder, encoder-decoder
Objective masked LM, causal LM, contrastive
Fine-tuning full fine-tune, LoRA, adapters
Retrieval chunk size, top-k documents
Inference decoding strategy, temperature

This broader view distinguishes AutoML from ordinary hyperparameter search.

AutoML as Nested Optimization

AutoML can be described as nested optimization.

The inner optimization trains model parameters:

$$ \theta^\ast(c) = \arg\min_{\theta} \mathcal{L}_{\text{train}}(\theta;c), $$

where $c$ is a full pipeline configuration.

The outer optimization selects the configuration:

$$ c^\ast = \arg\min_{c\in\mathcal{C}} \mathcal{L}_{\text{val}}(\theta^\ast(c);c). $$

Here $\mathcal{C}$ is the AutoML search space.

The outer loop is expensive because every candidate configuration may require training. This is why AutoML systems rely on early stopping, pruning, surrogate models, low-fidelity evaluations, and parallel execution.

Search Strategies

AutoML systems combine search strategies from earlier sections.

Strategy Role in AutoML
Grid search Small fixed spaces
Random search Strong baseline for broad search
Bayesian optimization Sample-efficient search
Population-based training Dynamic hyperparameter schedules
Neural architecture search Structural model choices
Multi-fidelity search Cheap approximations before full training
Evolutionary algorithms Irregular and discrete spaces

A mature AutoML system may use several strategies at once. For example, it may use random search for initial trials, Bayesian optimization after enough observations, pruning for poor configurations, and final retraining for the top candidates.

Multi-Fidelity Optimization

Full training is expensive. Multi-fidelity optimization evaluates many configurations cheaply, then spends more compute on promising ones.

Lower-fidelity evaluations include:

Lower fidelity method Approximation
Fewer epochs Train briefly
Smaller dataset Train on a subset
Lower resolution Use smaller images
Shorter sequence length Use fewer tokens
Smaller model Use proxy architecture
Fewer diffusion steps Approximate generation quality

The assumption is that cheap evaluations are correlated with full training results.

Successive halving is a common multi-fidelity strategy. It begins with many configurations trained for a small budget. It keeps the best fraction and increases their budgets.

Hyperband extends this idea by trying multiple budget schedules.

A simplified successive halving loop:

configs = sample_many(search_space, n=64)
budget = 1

while len(configs) > 1:
    results = []

    for config in configs:
        score = train_and_evaluate(config, epochs=budget)
        results.append((score, config))

    results.sort(reverse=True)
    configs = [config for score, config in results[:len(results) // 2]]
    budget *= 2

This avoids fully training configurations that perform poorly early.

AutoML with PyTorch

A practical PyTorch AutoML system usually has four layers.

Layer Responsibility
Configuration schema Defines all tunable choices
Model factory Builds models from configs
Training engine Runs training and evaluation
Search controller Chooses configs and records results

A simple model factory:

def build_model(config):
    family = config["model"]["family"]

    if family == "mlp":
        return MLP(
            input_dim=config["data"]["input_dim"],
            hidden_dim=config["model"]["hidden_dim"],
            output_dim=config["data"]["num_classes"],
            num_layers=config["model"]["num_layers"],
            dropout=config["regularization"]["dropout"],
        )

    if family == "cnn":
        return CNN(
            num_classes=config["data"]["num_classes"],
            channels=config["model"]["channels"],
            blocks=config["model"]["blocks"],
        )

    raise ValueError(f"unknown model family: {family}")

A training function:

def run_trial(config):
    model = build_model(config)
    train_loader, val_loader = build_loaders(config)

    optimizer = build_optimizer(model, config)
    scheduler = build_scheduler(optimizer, config)

    for epoch in range(config["training"]["epochs"]):
        train_one_epoch(model, train_loader, optimizer)
        scheduler.step()

    return evaluate(model, val_loader)

The search controller can call run_trial(config) using random search, Bayesian optimization, or another strategy.

Configuration Schemas

AutoML systems need explicit configuration schemas. The schema defines valid fields, types, defaults, and constraints.

A loose dictionary is easy to start with, but a schema becomes important as experiments grow.

Example schema style:

from dataclasses import dataclass

@dataclass
class ModelConfig:
    family: str
    hidden_dim: int
    num_layers: int
    dropout: float

@dataclass
class OptimizerConfig:
    name: str
    learning_rate: float
    weight_decay: float

@dataclass
class TrainingConfig:
    batch_size: int
    epochs: int
    mixed_precision: bool

@dataclass
class Config:
    model: ModelConfig
    optimizer: OptimizerConfig
    training: TrainingConfig

A schema prevents accidental errors such as misspelled keys, missing fields, or invalid types. This matters because AutoML may run hundreds of experiments without human inspection.

Experiment Tracking

AutoML produces many trials. Each trial should be logged completely.

At minimum, log:

Item Purpose
Full configuration Reproduce the run
Validation metrics Compare candidates
Training metrics Diagnose learning behavior
Random seed Reproduce stochastic choices
Code version Match implementation
Dataset version Avoid data drift
Hardware information Interpret speed and memory
Failure reason Debug invalid regions
Checkpoint path Reload top models

Without tracking, AutoML results become difficult to trust. The best trial may be impossible to reproduce.

A simple log record:

record = {
    "trial_id": trial_id,
    "config": config,
    "val_accuracy": val_accuracy,
    "val_loss": val_loss,
    "seed": seed,
    "status": "completed",
}

For production work, these records are usually stored in a database or experiment tracking system.

Constraints and Deployment Objectives

AutoML often optimizes under constraints.

For example:

$$ \text{maximize accuracy} $$

subject to

$$ \text{latency} \le 20\text{ ms}, $$

$$ \text{memory} \le 512\text{ MB}. $$

The system can reject configurations that violate hard constraints or include penalties in the objective.

A scalar objective may be:

$$ J(c) = \text{accuracy}(c) - \alpha \cdot \max(0,\text{latency}(c)-20) - \beta \cdot \max(0,\text{memory}(c)-512). $$

Hard constraints are easier to interpret. Soft penalties are easier to optimize. In deployment-sensitive deep learning, both are common.

Ensembling and Model Selection

Some AutoML systems produce not one model but a set of models. An ensemble combines predictions from multiple trained models.

For classification, an ensemble may average probabilities:

$$ p(y\mid x) = \frac{1}{M} \sum_{m=1}^{M} p_m(y\mid x). $$

Ensembles often improve accuracy and calibration, but they increase inference cost. They are useful when validation performance matters more than latency.

For deployment, the selected model may be the best single model under a cost budget. For competitions, the selected model may be an ensemble of top trials.

AutoML for Foundation Models

For foundation models, full AutoML over architecture and pretraining is usually too expensive. Instead, automation is applied to smaller decisions.

Common targets include:

Area Automated choices
Fine-tuning Learning rate, LoRA rank, batch size, epochs
Retrieval chunk size, embedding model, top-k, reranker
Prompting prompt templates, demonstrations, decoding
Alignment reward model settings, preference data mixture
Inference temperature, top-p, max tokens, caching
Compression quantization level, distillation target

For example, retrieval-augmented generation may search over:

rag_space = {
    "chunk_size": [256, 512, 1024],
    "chunk_overlap": [0, 64, 128],
    "top_k": [3, 5, 10, 20],
    "rerank": [True, False],
}

The objective may be answer accuracy, citation quality, latency, and cost.

Human Judgment in AutoML

AutoML does not make model development automatic in the strong sense. It automates search within a space chosen by humans.

Human judgment remains necessary for:

Decision Why it matters
Problem formulation Defines what should be optimized
Data quality Often dominates model performance
Search space design Determines what can be found
Metric choice Controls what the system prefers
Constraint design Reflects deployment reality
Result interpretation Detects spurious wins
Final validation Prevents test-set leakage

A system can optimize the wrong objective very efficiently. This is a common failure mode.

Common Failure Modes

AutoML can fail in predictable ways.

First, it can overfit the validation set. Testing hundreds or thousands of configurations increases the chance that one looks good by noise.

Second, it can exploit metric weaknesses. If the metric is incomplete, the search may find configurations that improve the metric while harming real performance.

Third, it can use unfair comparisons. Some trials may receive more compute, better preprocessing, or different data.

Fourth, it can ignore deployment constraints. A model with high validation accuracy may be too large, slow, or expensive.

Fifth, it can produce irreproducible results when random seeds, code versions, or dataset versions are missing.

Practical Guidelines

Start with a strong manual baseline. AutoML should improve on a known reference, not replace basic modeling discipline.

Keep the first search space small. Search only high-impact choices such as learning rate, weight decay, batch size, dropout, and one or two architecture dimensions.

Use multi-fidelity methods when full training is expensive. Promote promising configurations and stop weak ones early.

Retrain the best configurations from scratch. This checks whether the result is stable.

Reserve a final test set. Use it only once the search process is complete.

Log everything needed to reproduce the result.

Summary

Automated machine learning searches over training pipelines, model choices, hyperparameters, and deployment settings. In deep learning, it combines hyperparameter optimization, neural architecture search, pruning, multi-fidelity evaluation, experiment tracking, and final model selection.

AutoML is useful when the search space is well designed, metrics are reliable, and experiments are logged carefully. It is most effective as a disciplined engineering system around model development, rather than a substitute for understanding data, objectives, and deployment constraints.