Reservoir Sampling

Select a uniform random sample from a stream of unknown length using fixed memory.

Reservoir Sampling

Reservoir Sampling selects a uniform random sample from a stream without knowing the stream length in advance. It keeps only a fixed-size reservoir and updates it as new items arrive.

The simplest version keeps one item. The general version keeps $k$ items.

Problem

Given a stream:

$$ x_1, x_2, \dots, x_n $$

choose $k$ items uniformly at random from the stream, using only:

$$ O(k) $$

memory.

Algorithm

Fill the reservoir with the first $k$ items. For the $i$-th item after that, choose a random integer $j$ from $0$ to $i$. If $j < k$, replace reservoir position $j$.

Here $i$ is zero-based.

reservoir_sampling(stream, k):
    R = empty array

    for i, x in enumerate(stream):
        if i < k:
            append x to R
        else:
            j = random integer from 0 to i
            if j < k:
                R[j] = x

    return R

Example

Let:

$$ stream = [a, b, c, d, e] $$

and:

$$ k = 2 $$

The reservoir first becomes:

$$ [a, b] $$

When $c, d, e$ arrive, each new item has a chance to replace one existing reservoir item. After the stream ends, every 2-item subset has equal probability.

Correctness

For any item $x_i$, the probability that it enters the reservoir when processed is:

$$ \frac{k}{i} $$

using one-based position $i$.

After entering, it must avoid replacement by every later item $x_t$ for $t = i + 1, \dots, n$. At step $t$, the probability that a specific reservoir item is replaced is:

$$ \frac{1}{t} $$

so the probability it survives that step is:

$$ 1 - \frac{1}{t} $$

Therefore its final probability is:

$$ \frac{k}{i} \prod_{t=i+1}^{n}\left(1 - \frac{1}{t}\right)

\frac{k}{i} \prod_{t=i+1}^{n}\frac{t-1}{t}

\frac{k}{n} $$

Thus every item is included with equal probability $k/n$.

Complexity

operation cost
memory $O(k)$
update per item $O(1)$
total time $O(n)$

The algorithm is online and does not need to know $n$ in advance.

When to Use

Use Reservoir Sampling when:

  • the input is a stream
  • the stream length is unknown
  • storing all items is impossible or unnecessary
  • a uniform sample is required

It is common in logging, telemetry, randomized testing, large file sampling, and database query processing.

Implementation

import random

def reservoir_sampling(stream, k):
    reservoir = []

    for i, x in enumerate(stream):
        if i < k:
            reservoir.append(x)
        else:
            j = random.randint(0, i)
            if j < k:
                reservoir[j] = x

    return reservoir
import "math/rand"

func ReservoirSampling[T any](stream []T, k int) []T {
	reservoir := make([]T, 0, k)

	for i, x := range stream {
		if i < k {
			reservoir = append(reservoir, x)
			continue
		}

		j := rand.Intn(i + 1)
		if j < k {
			reservoir[j] = x
		}
	}

	return reservoir
}