Neural Architecture Search

Neural architecture search, or NAS, is the process of automatically searching for model architectures. Ordinary hyperparameter optimization usually tunes values such as learning rate, batch size, dropout, or weight decay. NAS searches the structure of the network itself.

Architecture choices include the number of layers, hidden width, convolution kernel sizes, attention heads, skip connections, normalization placement, activation functions, and block types. In large models, architecture search may also include mixture-of-experts routing, context length, embedding dimension, MLP expansion ratio, and parameter sharing.

The Architecture Search Problem

A neural architecture defines a function class. Once the architecture is fixed, training chooses parameters inside that class.

Let $a$ denote an architecture and $\theta$ denote its trainable parameters. Training solves

$$ \theta^\ast(a) = \arg\min_{\theta} \mathcal{L}_{\text{train}}(\theta; a). $$

Architecture search chooses the architecture that performs best on validation data:

$$ a^\ast = \arg\min_{a\in\mathcal{A}} \mathcal{L}_{\text{val}}(\theta^\ast(a); a). $$

Here $\mathcal{A}$ is the architecture search space.

This is harder than ordinary hyperparameter optimization because architectures may have different tensor shapes, parameter counts, memory costs, and training dynamics.

Architecture Search Spaces

A search space defines which architectures may be considered. A poor search space can exclude good models or include too many invalid models.

For an MLP, a simple search space might include:

Choice	Possible values
Number of layers	2, 3, 4, 6, 8
Hidden dimension	128, 256, 512, 1024
Activation	ReLU, GELU, SiLU
Normalization	None, BatchNorm, LayerNorm
Dropout	0.0, 0.1, 0.2, 0.3

For a CNN, the space may include:

Choice	Possible values
Number of blocks	3 to 8
Channels per block	32, 64, 128, 256
Kernel size	3, 5, 7
Stride	1, 2
Residual connection	True, False
Squeeze-excitation	True, False

For a transformer, common architecture choices include:

Choice	Possible values
Number of layers	6, 12, 24, 32
Model dimension	512, 768, 1024, 2048
Attention heads	8, 12, 16, 32
MLP ratio	2, 4, 8
Normalization placement	Pre-norm, post-norm
Positional encoding	Learned, sinusoidal, rotary
Attention type	Full, local, sparse, linear
Experts	Dense, MoE

Architecture search spaces are often constrained. For example, in a transformer,

$$ d_{\text{model}} \bmod h = 0, $$

where $h$ is the number of attention heads. Each head then has dimension

$$ d_{\text{head}} = \frac{d_{\text{model}}}{h}. $$

Manual Search Versus Automated Search

Most successful deep learning architectures have involved substantial human design. NAS does not remove design work. It changes where design work happens.

Manual design chooses the architecture directly. NAS chooses the search space, objective, budget, and search algorithm.

Approach	Human decides	Algorithm decides
Manual design	Architecture	Nothing
Hyperparameter search	Search space and ranges	Best configuration
NAS	Search space and objective	Architecture within space

The search space usually encodes expert assumptions. For example, a CNN search space assumes locality and translation structure. A transformer search space assumes attention, residual connections, and token embeddings.

Search Algorithms

Several algorithms can be used for NAS.

Method	Basic idea
Random search	Sample architectures randomly
Bayesian optimization	Model performance as a function of architecture choices
Evolutionary search	Mutate and select architectures
Reinforcement learning	Train a controller to propose architectures
Differentiable NAS	Relax discrete architecture choices into continuous weights
Weight-sharing NAS	Train a supernet containing many subnetworks

Simple random search is often a strong baseline. More complex methods are useful when each architecture is expensive to train and the search space has exploitable structure.

Evolutionary Architecture Search

Evolutionary NAS maintains a population of architectures. Each architecture is trained and evaluated. Better architectures are selected as parents. New architectures are created by mutation or crossover.

A mutation might:

Mutation	Example
Add a layer	6 layers to 7 layers
Change width	512 hidden units to 768
Change kernel size	3x3 to 5x5
Add skip connection	Insert residual path
Change activation	ReLU to GELU

A simple evolutionary loop is:

population = initialize_architectures()

for generation in range(num_generations):
    scores = evaluate_population(population)

    parents = select_best(population, scores)

    children = []
    for parent in parents:
        child = mutate(parent)
        children.append(child)

    population = parents + children

Evolutionary methods are flexible. They handle discrete, conditional, and irregular architecture spaces well. Their main cost is the need to train many candidate architectures.

Reinforcement Learning NAS

In reinforcement learning NAS, a controller generates architecture descriptions. The generated architecture is trained and evaluated. The validation score is used as a reward to update the controller.

The controller may output a sequence such as:

$$ (\text{conv }3\times3,\ \text{conv }5\times5,\ \text{skip},\ \text{maxpool}). $$

The reward may be validation accuracy, or a deployment-aware score:

$$ R = \text{accuracy} - \alpha \cdot \text{latency} - \beta \cdot \text{memory}. $$

RL-based NAS can discover useful structures, but it is expensive and difficult to reproduce. For many projects, simpler methods give most of the benefit with less complexity.

Differentiable NAS

Differentiable NAS makes architecture choices continuous.

Suppose a layer may choose between several operations:

$$ o_1, o_2, \ldots, o_K. $$

Instead of choosing one operation directly, differentiable NAS computes a weighted mixture:

$$ \bar{o}(x) = \sum_{k=1}^{K} \alpha_k o_k(x), $$

where $\alpha_k$ are learnable architecture weights.

The weights are usually normalized with softmax:

$$ \alpha_k = \frac{\exp(w_k)} {\sum_j \exp(w_j)}. $$

Training then optimizes both model parameters and architecture weights. After training, the strongest operations are selected to form a discrete architecture.

Differentiable NAS can be more efficient than training many architectures from scratch, but the continuous relaxation may introduce bias. The best relaxed architecture may not correspond to the best discrete architecture.

Weight-sharing NAS trains one large supernet that contains many possible subnetworks. Each candidate architecture is a path through the supernet.

Instead of training every architecture separately, we train the supernet and estimate candidate performance using inherited weights.

This reduces cost, but introduces approximation error. A subnetwork may look strong inside the supernet but perform differently when trained alone.

Weight sharing is common in efficient NAS systems because full training of every architecture is usually too expensive.

PyTorch Example: Configurable MLP

A simple architecture search can be implemented by building models from configuration dictionaries.

import torch
from torch import nn

class SearchMLP(nn.Module):
    def __init__(self, input_dim, output_dim, config):
        super().__init__()

        activation_name = config["activation"]
        if activation_name == "relu":
            activation = nn.ReLU
        elif activation_name == "gelu":
            activation = nn.GELU
        elif activation_name == "silu":
            activation = nn.SiLU
        else:
            raise ValueError(f"unknown activation: {activation_name}")

        layers = []
        dim = input_dim

        for hidden_dim in config["hidden_dims"]:
            layers.append(nn.Linear(dim, hidden_dim))

            if config["normalization"] == "layernorm":
                layers.append(nn.LayerNorm(hidden_dim))

            layers.append(activation())

            if config["dropout"] > 0:
                layers.append(nn.Dropout(config["dropout"]))

            dim = hidden_dim

        layers.append(nn.Linear(dim, output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

A sampled architecture might be:

config = {
    "hidden_dims": [512, 512, 256],
    "activation": "gelu",
    "normalization": "layernorm",
    "dropout": 0.1,
}

The training code can treat this like any other PyTorch model.

PyTorch Example: Sampling Architectures

A basic random architecture sampler:

import random

def sample_architecture():
    num_layers = random.choice([2, 3, 4, 6])
    width = random.choice([128, 256, 512, 1024])

    return {
        "hidden_dims": [width] * num_layers,
        "activation": random.choice(["relu", "gelu", "silu"]),
        "normalization": random.choice(["none", "layernorm"]),
        "dropout": random.choice([0.0, 0.1, 0.2, 0.3]),
    }

A search loop:

best_score = float("-inf")
best_config = None

for trial in range(50):
    config = sample_architecture()

    model = SearchMLP(
        input_dim=784,
        output_dim=10,
        config=config,
    )

    score = train_and_evaluate(model)

    if score > best_score:
        best_score = score
        best_config = config

This is NAS in its simplest form. It searches over architecture configurations by repeatedly building, training, and evaluating candidate models.

Cost-Aware Architecture Search

Architecture quality cannot be measured only by accuracy. Large models may be too slow or expensive to deploy.

Useful architecture objectives include:

Objective	Meaning
Validation accuracy	Predictive quality
Validation loss	Calibration and optimization quality
Parameter count	Model size
FLOPs	Approximate compute cost
Latency	Real inference speed
Memory use	Deployment feasibility
Energy use	Operational cost

A cost-aware objective may be:

$$ J(a) = \text{accuracy}(a) - \alpha \log(\text{params}(a)) - \beta \log(\text{latency}(a)). $$

This prefers architectures that are accurate and efficient.

Latency should be measured on the target hardware when possible. FLOPs do not always predict real speed because memory access, kernel fusion, batching, and hardware utilization matter.

Common Failure Modes

NAS can fail in several ways.

One failure mode is an unrealistic search space. If the space excludes strong architectures, the search algorithm cannot find them.

Another failure mode is unfair evaluation. If one architecture trains longer or uses stronger augmentation, the comparison becomes confounded.

A third failure mode is search overfitting. The algorithm may exploit noise in the validation set after testing many architectures.

A fourth failure mode is proxy mismatch. An architecture that performs well on a small proxy dataset may perform poorly on the full dataset.

A fifth failure mode is hardware mismatch. An architecture that is efficient on one device may be slow on another.

Practical Guidelines

Use NAS only after strong baselines are established. A simple manually designed architecture with careful training often beats an automatically searched architecture with weak training.

Keep the search space small at first. Search over a few high-impact choices, such as width, depth, activation, normalization, and dropout.

Use fair training budgets. Each architecture should receive the same number of training steps or the same compute budget.

Log architecture, training configuration, score, parameter count, latency, and random seed. Without complete logs, NAS results are difficult to interpret.

Retain the top architectures and retrain them from scratch. This checks whether their performance was due to the architecture or lucky inherited weights, random seeds, or noisy validation.

When NAS Is Useful

NAS is useful when architecture choices strongly affect the result and many training runs are affordable. It is especially relevant when deployment constraints matter.

Examples include mobile vision models, efficient transformers, speech models, recommendation systems, and specialized scientific models.

NAS is less useful when the dominant performance gains come from data quality, training recipe, pretrained weights, or scale. In many modern foundation model settings, data, compute, and optimization recipe matter more than small architecture changes.

Summary

Neural architecture search automates the exploration of model structures. It searches over layers, widths, operations, connections, normalization, activation functions, and other architectural choices.

NAS can use random search, evolutionary algorithms, reinforcement learning, Bayesian optimization, differentiable relaxation, or weight-sharing supernets. The main challenge is cost: each architecture may require expensive training and careful evaluation.

A practical NAS workflow starts with strong baselines, defines a constrained search space, uses fair evaluation, includes deployment costs, and retrains top candidates from scratch.