Random Search

Random search is a hyperparameter optimization method that samples configurations at random from a search space.

Random search is a hyperparameter optimization method that samples configurations at random from a search space. Instead of evaluating every point on a fixed grid, random search chooses a fixed number of trials and draws each trial independently.

This method is simple, but it is often more effective than grid search in deep learning. The reason is that only a small number of hyperparameters usually dominate performance. Random search spends more trials exploring different values of important dimensions instead of repeatedly evaluating unimportant combinations.

The Basic Idea

Suppose we want to tune learning rate, batch size, hidden dimension, dropout, and optimizer. Grid search would require choosing a small finite set for each one and evaluating the full Cartesian product.

Random search instead defines distributions:

$$ \eta \sim \text{LogUniform}(10^{-5},10^{-1}), $$

$$ B \sim \text{Choice}{32,64,128,256}, $$

$$ p_{\text{drop}} \sim \text{Uniform}(0,0.5). $$

Each trial samples one configuration from these distributions.

For example:

config = {
    "learning_rate": 2.7e-4,
    "batch_size": 128,
    "hidden_dim": 512,
    "dropout": 0.18,
    "optimizer": "AdamW",
}

After training and validation, the result is recorded. The best configuration found so far is retained.

Why Random Search Helps

Grid search allocates equal attention to every search dimension. This is inefficient when some dimensions matter much more than others.

Assume validation performance depends strongly on learning rate but weakly on dropout. A grid with five learning rates and five dropout values evaluates only five distinct learning rates:

$$ 5 \times 5 = 25 $$

runs, but the learning rate takes only five possible values.

Random search with 25 trials can evaluate 25 different learning rates. This gives much better coverage of the important dimension.

The advantage becomes larger as the number of weak or irrelevant dimensions increases.

Consider two hyperparameters:

Method Learning rate values explored Dropout values explored Total trials
Grid search 5 fixed values 5 fixed values 25
Random search 25 sampled values 25 sampled values 25

With the same number of trials, random search explores more distinct values per dimension.

This matters especially for continuous hyperparameters. A grid imposes artificial resolution. Random sampling avoids this fixed lattice.

Defining Sampling Distributions

Random search requires distributions, not just candidate sets.

Common choices are:

Hyperparameter Suggested distribution
Learning rate Log-uniform
Weight decay Log-uniform
Dropout Uniform
Batch size Categorical
Hidden dimension Categorical
Number of layers Categorical
Warmup ratio Uniform
Gradient clipping norm Log-uniform
Label smoothing Uniform

A log-uniform distribution is useful when the right scale is unknown. For example, the useful learning rate may be $10^{-4}$, $10^{-3}$, or $10^{-2}$. Sampling uniformly in ordinary space would over-sample large values.

A log-uniform draw can be written as:

$$ u \sim \text{Uniform}(\log a,\log b), \qquad x = \exp(u). $$

If base 10 is used:

$$ u \sim \text{Uniform}(\log_{10} a,\log_{10} b), \qquad x = 10^u. $$

In Python:

import random
import math

def log_uniform(low, high):
    return math.exp(random.uniform(math.log(low), math.log(high)))

learning_rate = log_uniform(1e-5, 1e-1)

A Minimal Random Search Implementation

A search space can be represented as a dictionary:

search_space = {
    "learning_rate": ("log_uniform", 1e-5, 1e-1),
    "weight_decay": ("log_uniform", 1e-6, 1e-1),
    "batch_size": ("choice", [32, 64, 128, 256]),
    "hidden_dim": ("choice", [128, 256, 512, 1024]),
    "dropout": ("uniform", 0.0, 0.5),
    "optimizer": ("choice", ["SGD", "Adam", "AdamW"]),
}

We can implement a sampler:

import random
import math

def sample_value(spec):
    kind = spec[0]

    if kind == "choice":
        return random.choice(spec[1])

    if kind == "uniform":
        low, high = spec[1], spec[2]
        return random.uniform(low, high)

    if kind == "log_uniform":
        low, high = spec[1], spec[2]
        return math.exp(random.uniform(math.log(low), math.log(high)))

    raise ValueError(f"unknown search distribution: {kind}")

def sample_config(search_space):
    return {
        name: sample_value(spec)
        for name, spec in search_space.items()
    }

Then random search becomes:

best_config = None
best_score = float("-inf")

num_trials = 50

for trial in range(num_trials):
    config = sample_config(search_space)

    score = train_and_evaluate(config)

    if score > best_score:
        best_score = score
        best_config = config

print("best score:", best_score)
print("best config:", best_config)

The function train_and_evaluate should construct the model, optimizer, scheduler, data loaders, and training loop from the configuration.

Some hyperparameters only make sense under certain choices.

For example, momentum matters for SGD:

if optimizer == "SGD":
    momentum = sample_uniform(0.0, 0.99)

Adam beta values matter for Adam and AdamW:

if optimizer in {"Adam", "AdamW"}:
    beta1 = sample_uniform(0.8, 0.99)
    beta2 = sample_uniform(0.9, 0.9999)

A conditional sampler can encode this directly:

def sample_optimizer_config():
    optimizer = random.choice(["SGD", "Adam", "AdamW"])

    config = {"optimizer": optimizer}

    if optimizer == "SGD":
        config["momentum"] = random.uniform(0.0, 0.99)

    if optimizer in {"Adam", "AdamW"}:
        config["beta1"] = random.uniform(0.8, 0.99)
        config["beta2"] = log_uniform(0.9, 0.9999)

    return config

Conditional search spaces avoid meaningless parameters. This improves search efficiency and makes the result easier to interpret.

Number of Trials

Random search requires choosing a trial budget. The budget depends on training cost, available hardware, and search-space size.

For small models, hundreds of trials may be feasible. For large models, even ten trials may be expensive.

A practical pattern is:

Training cost per run Reasonable initial trials
Seconds 100 to 1000
Minutes 50 to 200
Hours 10 to 50
Days 3 to 10

The search should begin with wide ranges and a modest budget. After promising regions are found, a second random search can focus on narrower ranges.

Random search is often used in stages.

First, run a broad search:

broad_space = {
    "learning_rate": ("log_uniform", 1e-5, 1e-1),
    "weight_decay": ("log_uniform", 1e-6, 1e-1),
    "dropout": ("uniform", 0.0, 0.5),
}

Suppose the best results cluster near:

$$ \eta \in [10^{-4},10^{-3}], \qquad \lambda_{\text{wd}} \in [10^{-3},10^{-2}]. $$

Then define a narrower search:

narrow_space = {
    "learning_rate": ("log_uniform", 1e-4, 1e-3),
    "weight_decay": ("log_uniform", 1e-3, 1e-2),
    "dropout": ("uniform", 0.05, 0.2),
}

This second stage increases resolution where useful configurations are likely.

Random Seeds and Reproducibility

Random search introduces randomness in two places.

First, the search algorithm samples configurations randomly. Second, model training itself is stochastic due to initialization, data shuffling, dropout, nondeterministic GPU kernels, and augmentation.

To improve reproducibility, save:

Item Purpose
Search seed Reproduce sampled configurations
Training seed Reproduce model initialization and data order
Full configuration Rebuild the experiment
Validation metrics Compare trials
Checkpoint path Reload selected model
Code version Match implementation

Example:

import random
import torch

def set_seed(seed):
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

Each trial can receive its own seed:

base_seed = 1234

for trial in range(num_trials):
    trial_seed = base_seed + trial
    set_seed(trial_seed)

    config = sample_config(search_space)
    config["seed"] = trial_seed

    score = train_and_evaluate(config)

This makes the search easier to audit.

Handling Failed Trials

Random search may sample unstable or invalid configurations. A learning rate may be too large. A batch size may exceed memory. A transformer hidden size may be incompatible with the number of heads.

Failed trials should be recorded, not silently ignored.

results = []

for trial in range(num_trials):
    config = sample_config(search_space)

    try:
        score = train_and_evaluate(config)
        status = "ok"

    except RuntimeError as e:
        score = None
        status = "failed"
        error = str(e)

    results.append({
        "trial": trial,
        "config": config,
        "score": score,
        "status": status,
    })

This record helps identify bad regions of the search space. If many trials fail, the search space should be constrained.

Comparing Configurations Fairly

A configuration should be compared under the same evaluation protocol.

This means:

Factor Should be fixed across trials
Training data Same split
Validation data Same split
Number of epochs Same budget unless using early stopping
Evaluation metric Same metric
Preprocessing Same rules unless intentionally searched
Random seed policy Same procedure
Hardware assumptions Same precision and device type

If one configuration receives more training steps than another, its score may reflect extra compute rather than better hyperparameters.

For budget-aware optimization, use an explicit objective such as validation accuracy after a fixed number of steps, or validation loss under a fixed GPU-hour budget.

Search Results as Data

Random search produces useful diagnostic data. Even failed or mediocre trials can show which hyperparameters matter.

After running trials, we can sort by validation score:

results = sorted(results, key=lambda r: r["score"], reverse=True)

We can inspect the best configurations:

for r in results[:5]:
    print(r["score"], r["config"])

Patterns are often more valuable than a single best configuration. For example, the top configurations may all use AdamW, learning rates near $3\times10^{-4}$, dropout below 0.2, and moderate weight decay. This suggests a stable region of the search space.

Advantages and Disadvantages

Advantages Disadvantages
Simple to implement No guarantee of finding the optimum
Works well in high-dimensional spaces Results vary across random seeds
Efficient for continuous variables Can waste trials in bad regions
Easy to parallelize Does not learn from previous trials
Better than grid search for many DL tasks Needs careful distribution design

Random search is a strong default when the search space is moderately large and the cost per run is acceptable.

Random search is a good choice when:

Situation Reason
Several hyperparameters matter Better coverage than grid search
Some dimensions are continuous Avoids fixed grid resolution
Search budget is limited Can stop after any number of trials
Parallel workers are available Trials are independent
Baselines are needed quickly Simple and robust

Random search is less suitable when each run is extremely expensive and only a few trials are possible. In that case, expert tuning or Bayesian optimization may use the budget more effectively.

Summary

Random search samples hyperparameter configurations from predefined distributions. It replaces exhaustive enumeration with stochastic exploration.

Compared with grid search, random search often covers important dimensions more effectively under the same trial budget. It works especially well when only a few hyperparameters strongly influence performance.

A good random search depends on well-designed sampling distributions, proper logging, valid constraints, reproducible seeds, and fair evaluation. It is simple enough to be a baseline and strong enough to be useful in real deep learning systems.