Neural Architecture Search

Neural architecture search, or NAS, is the process of automatically searching for model architectures.

Neural architecture search, or NAS, is the process of automatically searching for model architectures. Ordinary hyperparameter optimization usually tunes values such as learning rate, batch size, dropout, or weight decay. NAS searches the structure of the network itself.

Architecture choices include the number of layers, hidden width, convolution kernel sizes, attention heads, skip connections, normalization placement, activation functions, and block types. In large models, architecture search may also include mixture-of-experts routing, context length, embedding dimension, MLP expansion ratio, and parameter sharing.

The Architecture Search Problem

A neural architecture defines a function class. Once the architecture is fixed, training chooses parameters inside that class.

Let $a$ denote an architecture and $\theta$ denote its trainable parameters. Training solves

$$ \theta^\ast(a) = \arg\min_{\theta} \mathcal{L}_{\text{train}}(\theta; a). $$

Architecture search chooses the architecture that performs best on validation data:

$$ a^\ast = \arg\min_{a\in\mathcal{A}} \mathcal{L}_{\text{val}}(\theta^\ast(a); a). $$

Here $\mathcal{A}$ is the architecture search space.

This is harder than ordinary hyperparameter optimization because architectures may have different tensor shapes, parameter counts, memory costs, and training dynamics.

Architecture Search Spaces

A search space defines which architectures may be considered. A poor search space can exclude good models or include too many invalid models.

For an MLP, a simple search space might include:

Choice Possible values
Number of layers 2, 3, 4, 6, 8
Hidden dimension 128, 256, 512, 1024
Activation ReLU, GELU, SiLU
Normalization None, BatchNorm, LayerNorm
Dropout 0.0, 0.1, 0.2, 0.3

For a CNN, the space may include:

Choice Possible values
Number of blocks 3 to 8
Channels per block 32, 64, 128, 256
Kernel size 3, 5, 7
Stride 1, 2
Residual connection True, False
Squeeze-excitation True, False

For a transformer, common architecture choices include:

Choice Possible values
Number of layers 6, 12, 24, 32
Model dimension 512, 768, 1024, 2048
Attention heads 8, 12, 16, 32
MLP ratio 2, 4, 8
Normalization placement Pre-norm, post-norm
Positional encoding Learned, sinusoidal, rotary
Attention type Full, local, sparse, linear
Experts Dense, MoE

Architecture search spaces are often constrained. For example, in a transformer,

$$ d_{\text{model}} \bmod h = 0, $$

where $h$ is the number of attention heads. Each head then has dimension

$$ d_{\text{head}} = \frac{d_{\text{model}}}{h}. $$

Most successful deep learning architectures have involved substantial human design. NAS does not remove design work. It changes where design work happens.

Manual design chooses the architecture directly. NAS chooses the search space, objective, budget, and search algorithm.

Approach Human decides Algorithm decides
Manual design Architecture Nothing
Hyperparameter search Search space and ranges Best configuration
NAS Search space and objective Architecture within space

The search space usually encodes expert assumptions. For example, a CNN search space assumes locality and translation structure. A transformer search space assumes attention, residual connections, and token embeddings.

Search Algorithms

Several algorithms can be used for NAS.

Method Basic idea
Random search Sample architectures randomly
Bayesian optimization Model performance as a function of architecture choices
Evolutionary search Mutate and select architectures
Reinforcement learning Train a controller to propose architectures
Differentiable NAS Relax discrete architecture choices into continuous weights
Weight-sharing NAS Train a supernet containing many subnetworks

Simple random search is often a strong baseline. More complex methods are useful when each architecture is expensive to train and the search space has exploitable structure.

Evolutionary NAS maintains a population of architectures. Each architecture is trained and evaluated. Better architectures are selected as parents. New architectures are created by mutation or crossover.

A mutation might:

Mutation Example
Add a layer 6 layers to 7 layers
Change width 512 hidden units to 768
Change kernel size 3x3 to 5x5
Add skip connection Insert residual path
Change activation ReLU to GELU

A simple evolutionary loop is:

population = initialize_architectures()

for generation in range(num_generations):
    scores = evaluate_population(population)

    parents = select_best(population, scores)

    children = []
    for parent in parents:
        child = mutate(parent)
        children.append(child)

    population = parents + children

Evolutionary methods are flexible. They handle discrete, conditional, and irregular architecture spaces well. Their main cost is the need to train many candidate architectures.

Reinforcement Learning NAS

In reinforcement learning NAS, a controller generates architecture descriptions. The generated architecture is trained and evaluated. The validation score is used as a reward to update the controller.

The controller may output a sequence such as:

$$ (\text{conv }3\times3,\ \text{conv }5\times5,\ \text{skip},\ \text{maxpool}). $$

The reward may be validation accuracy, or a deployment-aware score:

$$ R = \text{accuracy} - \alpha \cdot \text{latency} - \beta \cdot \text{memory}. $$

RL-based NAS can discover useful structures, but it is expensive and difficult to reproduce. For many projects, simpler methods give most of the benefit with less complexity.

Differentiable NAS

Differentiable NAS makes architecture choices continuous.

Suppose a layer may choose between several operations:

$$ o_1, o_2, \ldots, o_K. $$

Instead of choosing one operation directly, differentiable NAS computes a weighted mixture:

$$ \bar{o}(x) = \sum_{k=1}^{K} \alpha_k o_k(x), $$

where $\alpha_k$ are learnable architecture weights.

The weights are usually normalized with softmax:

$$ \alpha_k = \frac{\exp(w_k)} {\sum_j \exp(w_j)}. $$

Training then optimizes both model parameters and architecture weights. After training, the strongest operations are selected to form a discrete architecture.

Differentiable NAS can be more efficient than training many architectures from scratch, but the continuous relaxation may introduce bias. The best relaxed architecture may not correspond to the best discrete architecture.

Weight Sharing and Supernets

Weight-sharing NAS trains one large supernet that contains many possible subnetworks. Each candidate architecture is a path through the supernet.

Instead of training every architecture separately, we train the supernet and estimate candidate performance using inherited weights.

This reduces cost, but introduces approximation error. A subnetwork may look strong inside the supernet but perform differently when trained alone.

Weight sharing is common in efficient NAS systems because full training of every architecture is usually too expensive.

PyTorch Example: Configurable MLP

A simple architecture search can be implemented by building models from configuration dictionaries.

import torch
from torch import nn

class SearchMLP(nn.Module):
    def __init__(self, input_dim, output_dim, config):
        super().__init__()

        activation_name = config["activation"]
        if activation_name == "relu":
            activation = nn.ReLU
        elif activation_name == "gelu":
            activation = nn.GELU
        elif activation_name == "silu":
            activation = nn.SiLU
        else:
            raise ValueError(f"unknown activation: {activation_name}")

        layers = []
        dim = input_dim

        for hidden_dim in config["hidden_dims"]:
            layers.append(nn.Linear(dim, hidden_dim))

            if config["normalization"] == "layernorm":
                layers.append(nn.LayerNorm(hidden_dim))

            layers.append(activation())

            if config["dropout"] > 0:
                layers.append(nn.Dropout(config["dropout"]))

            dim = hidden_dim

        layers.append(nn.Linear(dim, output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

A sampled architecture might be:

config = {
    "hidden_dims": [512, 512, 256],
    "activation": "gelu",
    "normalization": "layernorm",
    "dropout": 0.1,
}

The training code can treat this like any other PyTorch model.

PyTorch Example: Sampling Architectures

A basic random architecture sampler:

import random

def sample_architecture():
    num_layers = random.choice([2, 3, 4, 6])
    width = random.choice([128, 256, 512, 1024])

    return {
        "hidden_dims": [width] * num_layers,
        "activation": random.choice(["relu", "gelu", "silu"]),
        "normalization": random.choice(["none", "layernorm"]),
        "dropout": random.choice([0.0, 0.1, 0.2, 0.3]),
    }

A search loop:

best_score = float("-inf")
best_config = None

for trial in range(50):
    config = sample_architecture()

    model = SearchMLP(
        input_dim=784,
        output_dim=10,
        config=config,
    )

    score = train_and_evaluate(model)

    if score > best_score:
        best_score = score
        best_config = config

This is NAS in its simplest form. It searches over architecture configurations by repeatedly building, training, and evaluating candidate models.

Architecture quality cannot be measured only by accuracy. Large models may be too slow or expensive to deploy.

Useful architecture objectives include:

Objective Meaning
Validation accuracy Predictive quality
Validation loss Calibration and optimization quality
Parameter count Model size
FLOPs Approximate compute cost
Latency Real inference speed
Memory use Deployment feasibility
Energy use Operational cost

A cost-aware objective may be:

$$ J(a) = \text{accuracy}(a) - \alpha \log(\text{params}(a)) - \beta \log(\text{latency}(a)). $$

This prefers architectures that are accurate and efficient.

Latency should be measured on the target hardware when possible. FLOPs do not always predict real speed because memory access, kernel fusion, batching, and hardware utilization matter.

Common Failure Modes

NAS can fail in several ways.

One failure mode is an unrealistic search space. If the space excludes strong architectures, the search algorithm cannot find them.

Another failure mode is unfair evaluation. If one architecture trains longer or uses stronger augmentation, the comparison becomes confounded.

A third failure mode is search overfitting. The algorithm may exploit noise in the validation set after testing many architectures.

A fourth failure mode is proxy mismatch. An architecture that performs well on a small proxy dataset may perform poorly on the full dataset.

A fifth failure mode is hardware mismatch. An architecture that is efficient on one device may be slow on another.

Practical Guidelines

Use NAS only after strong baselines are established. A simple manually designed architecture with careful training often beats an automatically searched architecture with weak training.

Keep the search space small at first. Search over a few high-impact choices, such as width, depth, activation, normalization, and dropout.

Use fair training budgets. Each architecture should receive the same number of training steps or the same compute budget.

Log architecture, training configuration, score, parameter count, latency, and random seed. Without complete logs, NAS results are difficult to interpret.

Retain the top architectures and retrain them from scratch. This checks whether their performance was due to the architecture or lucky inherited weights, random seeds, or noisy validation.

When NAS Is Useful

NAS is useful when architecture choices strongly affect the result and many training runs are affordable. It is especially relevant when deployment constraints matter.

Examples include mobile vision models, efficient transformers, speech models, recommendation systems, and specialized scientific models.

NAS is less useful when the dominant performance gains come from data quality, training recipe, pretrained weights, or scale. In many modern foundation model settings, data, compute, and optimization recipe matter more than small architecture changes.

Summary

Neural architecture search automates the exploration of model structures. It searches over layers, widths, operations, connections, normalization, activation functions, and other architectural choices.

NAS can use random search, evolutionary algorithms, reinforcement learning, Bayesian optimization, differentiable relaxation, or weight-sharing supernets. The main challenge is cost: each architecture may require expensive training and careful evaluation.

A practical NAS workflow starts with strong baselines, defines a constrained search space, uses fair evaluation, includes deployment costs, and retrains top candidates from scratch.