Random Search

Random search is a hyperparameter optimization method that samples configurations at random from a search space. Instead of evaluating every point on a fixed grid, random search chooses a fixed number of trials and draws each trial independently.

This method is simple, but it is often more effective than grid search in deep learning. The reason is that only a small number of hyperparameters usually dominate performance. Random search spends more trials exploring different values of important dimensions instead of repeatedly evaluating unimportant combinations.

The Basic Idea

Suppose we want to tune learning rate, batch size, hidden dimension, dropout, and optimizer. Grid search would require choosing a small finite set for each one and evaluating the full Cartesian product.

Random search instead defines distributions:

$$ \eta \sim \text{LogUniform}(10^{-5},10^{-1}), $$

$$ B \sim \text{Choice}{32,64,128,256}, $$

$$ p_{\text{drop}} \sim \text{Uniform}(0,0.5). $$

Each trial samples one configuration from these distributions.

For example:

config = {
    "learning_rate": 2.7e-4,
    "batch_size": 128,
    "hidden_dim": 512,
    "dropout": 0.18,
    "optimizer": "AdamW",
}

After training and validation, the result is recorded. The best configuration found so far is retained.

Why Random Search Helps

Grid search allocates equal attention to every search dimension. This is inefficient when some dimensions matter much more than others.

Assume validation performance depends strongly on learning rate but weakly on dropout. A grid with five learning rates and five dropout values evaluates only five distinct learning rates:

$$ 5 \times 5 = 25 $$

runs, but the learning rate takes only five possible values.

Random search with 25 trials can evaluate 25 different learning rates. This gives much better coverage of the important dimension.

The advantage becomes larger as the number of weak or irrelevant dimensions increases.

Random Search Versus Grid Search

Consider two hyperparameters:

Method	Learning rate values explored	Dropout values explored	Total trials
Grid search	5 fixed values	5 fixed values	25
Random search	25 sampled values	25 sampled values	25

With the same number of trials, random search explores more distinct values per dimension.

This matters especially for continuous hyperparameters. A grid imposes artificial resolution. Random sampling avoids this fixed lattice.

Defining Sampling Distributions

Random search requires distributions, not just candidate sets.

Common choices are:

Hyperparameter	Suggested distribution
Learning rate	Log-uniform
Weight decay	Log-uniform
Dropout	Uniform
Batch size	Categorical
Hidden dimension	Categorical
Number of layers	Categorical
Warmup ratio	Uniform
Gradient clipping norm	Log-uniform
Label smoothing	Uniform

A log-uniform distribution is useful when the right scale is unknown. For example, the useful learning rate may be $10^{-4}$, $10^{-3}$, or $10^{-2}$. Sampling uniformly in ordinary space would over-sample large values.

A log-uniform draw can be written as:

$$ u \sim \text{Uniform}(\log a,\log b), \qquad x = \exp(u). $$

If base 10 is used:

$$ u \sim \text{Uniform}(\log_{10} a,\log_{10} b), \qquad x = 10^u. $$

In Python:

import random
import math

def log_uniform(low, high):
    return math.exp(random.uniform(math.log(low), math.log(high)))

learning_rate = log_uniform(1e-5, 1e-1)

A Minimal Random Search Implementation

A search space can be represented as a dictionary:

search_space = {
    "learning_rate": ("log_uniform", 1e-5, 1e-1),
    "weight_decay": ("log_uniform", 1e-6, 1e-1),
    "batch_size": ("choice", [32, 64, 128, 256]),
    "hidden_dim": ("choice", [128, 256, 512, 1024]),
    "dropout": ("uniform", 0.0, 0.5),
    "optimizer": ("choice", ["SGD", "Adam", "AdamW"]),
}

We can implement a sampler:

import random
import math

def sample_value(spec):
    kind = spec[0]

    if kind == "choice":
        return random.choice(spec[1])

    if kind == "uniform":
        low, high = spec[1], spec[2]
        return random.uniform(low, high)

    if kind == "log_uniform":
        low, high = spec[1], spec[2]
        return math.exp(random.uniform(math.log(low), math.log(high)))

    raise ValueError(f"unknown search distribution: {kind}")

def sample_config(search_space):
    return {
        name: sample_value(spec)
        for name, spec in search_space.items()
    }

Then random search becomes:

best_config = None
best_score = float("-inf")

num_trials = 50

for trial in range(num_trials):
    config = sample_config(search_space)

    score = train_and_evaluate(config)

    if score > best_score:
        best_score = score
        best_config = config

print("best score:", best_score)
print("best config:", best_config)

The function train_and_evaluate should construct the model, optimizer, scheduler, data loaders, and training loop from the configuration.

Conditional Random Search

Some hyperparameters only make sense under certain choices.

For example, momentum matters for SGD:

if optimizer == "SGD":
    momentum = sample_uniform(0.0, 0.99)

Adam beta values matter for Adam and AdamW:

if optimizer in {"Adam", "AdamW"}:
    beta1 = sample_uniform(0.8, 0.99)
    beta2 = sample_uniform(0.9, 0.9999)

A conditional sampler can encode this directly:

def sample_optimizer_config():
    optimizer = random.choice(["SGD", "Adam", "AdamW"])

    config = {"optimizer": optimizer}

    if optimizer == "SGD":
        config["momentum"] = random.uniform(0.0, 0.99)

    if optimizer in {"Adam", "AdamW"}:
        config["beta1"] = random.uniform(0.8, 0.99)
        config["beta2"] = log_uniform(0.9, 0.9999)

    return config

Conditional search spaces avoid meaningless parameters. This improves search efficiency and makes the result easier to interpret.

Number of Trials

Random search requires choosing a trial budget. The budget depends on training cost, available hardware, and search-space size.

For small models, hundreds of trials may be feasible. For large models, even ten trials may be expensive.

A practical pattern is:

Training cost per run	Reasonable initial trials
Seconds	100 to 1000
Minutes	50 to 200
Hours	10 to 50
Days	3 to 10

The search should begin with wide ranges and a modest budget. After promising regions are found, a second random search can focus on narrower ranges.

Coarse-to-Fine Random Search

Random search is often used in stages.

First, run a broad search:

broad_space = {
    "learning_rate": ("log_uniform", 1e-5, 1e-1),
    "weight_decay": ("log_uniform", 1e-6, 1e-1),
    "dropout": ("uniform", 0.0, 0.5),
}

Suppose the best results cluster near:

$$ \eta \in [10^{-4},10^{-3}], \qquad \lambda_{\text{wd}} \in [10^{-3},10^{-2}]. $$

Then define a narrower search:

narrow_space = {
    "learning_rate": ("log_uniform", 1e-4, 1e-3),
    "weight_decay": ("log_uniform", 1e-3, 1e-2),
    "dropout": ("uniform", 0.05, 0.2),
}

This second stage increases resolution where useful configurations are likely.

Random Seeds and Reproducibility

Random search introduces randomness in two places.

First, the search algorithm samples configurations randomly. Second, model training itself is stochastic due to initialization, data shuffling, dropout, nondeterministic GPU kernels, and augmentation.

To improve reproducibility, save:

Item	Purpose
Search seed	Reproduce sampled configurations
Training seed	Reproduce model initialization and data order
Full configuration	Rebuild the experiment
Validation metrics	Compare trials
Checkpoint path	Reload selected model
Code version	Match implementation

Example:

import random
import torch

def set_seed(seed):
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

Each trial can receive its own seed:

base_seed = 1234

for trial in range(num_trials):
    trial_seed = base_seed + trial
    set_seed(trial_seed)

    config = sample_config(search_space)
    config["seed"] = trial_seed

    score = train_and_evaluate(config)

This makes the search easier to audit.

Handling Failed Trials

Random search may sample unstable or invalid configurations. A learning rate may be too large. A batch size may exceed memory. A transformer hidden size may be incompatible with the number of heads.

Failed trials should be recorded, not silently ignored.

results = []

for trial in range(num_trials):
    config = sample_config(search_space)

    try:
        score = train_and_evaluate(config)
        status = "ok"

    except RuntimeError as e:
        score = None
        status = "failed"
        error = str(e)

    results.append({
        "trial": trial,
        "config": config,
        "score": score,
        "status": status,
    })

This record helps identify bad regions of the search space. If many trials fail, the search space should be constrained.

Comparing Configurations Fairly

A configuration should be compared under the same evaluation protocol.

This means:

Factor	Should be fixed across trials
Training data	Same split
Validation data	Same split
Number of epochs	Same budget unless using early stopping
Evaluation metric	Same metric
Preprocessing	Same rules unless intentionally searched
Random seed policy	Same procedure
Hardware assumptions	Same precision and device type

If one configuration receives more training steps than another, its score may reflect extra compute rather than better hyperparameters.

For budget-aware optimization, use an explicit objective such as validation accuracy after a fixed number of steps, or validation loss under a fixed GPU-hour budget.

Search Results as Data

Random search produces useful diagnostic data. Even failed or mediocre trials can show which hyperparameters matter.

After running trials, we can sort by validation score:

results = sorted(results, key=lambda r: r["score"], reverse=True)

We can inspect the best configurations:

for r in results[:5]:
    print(r["score"], r["config"])

Patterns are often more valuable than a single best configuration. For example, the top configurations may all use AdamW, learning rates near $3\times10^{-4}$, dropout below 0.2, and moderate weight decay. This suggests a stable region of the search space.

Advantages and Disadvantages

Advantages	Disadvantages
Simple to implement	No guarantee of finding the optimum
Works well in high-dimensional spaces	Results vary across random seeds
Efficient for continuous variables	Can waste trials in bad regions
Easy to parallelize	Does not learn from previous trials
Better than grid search for many DL tasks	Needs careful distribution design

Random search is a strong default when the search space is moderately large and the cost per run is acceptable.

When to Use Random Search

Random search is a good choice when:

Situation	Reason
Several hyperparameters matter	Better coverage than grid search
Some dimensions are continuous	Avoids fixed grid resolution
Search budget is limited	Can stop after any number of trials
Parallel workers are available	Trials are independent
Baselines are needed quickly	Simple and robust

Random search is less suitable when each run is extremely expensive and only a few trials are possible. In that case, expert tuning or Bayesian optimization may use the budget more effectively.

Summary

Random search samples hyperparameter configurations from predefined distributions. It replaces exhaustive enumeration with stochastic exploration.

Compared with grid search, random search often covers important dimensions more effectively under the same trial budget. It works especially well when only a few hyperparameters strongly influence performance.

A good random search depends on well-designed sampling distributions, proper logging, valid constraints, reproducible seeds, and fair evaluation. It is simple enough to be a baseline and strong enough to be useful in real deep learning systems.