Reservoir Sampling
Select a uniform random sample from a stream of unknown length using fixed memory.
Reservoir Sampling
Reservoir Sampling selects a uniform random sample from a stream without knowing the stream length in advance. It keeps only a fixed-size reservoir and updates it as new items arrive.
The simplest version keeps one item. The general version keeps $k$ items.
Problem
Given a stream:
$$ x_1, x_2, \dots, x_n $$
choose $k$ items uniformly at random from the stream, using only:
$$ O(k) $$
memory.
Algorithm
Fill the reservoir with the first $k$ items. For the $i$-th item after that, choose a random integer $j$ from $0$ to $i$. If $j < k$, replace reservoir position $j$.
Here $i$ is zero-based.
reservoir_sampling(stream, k):
R = empty array
for i, x in enumerate(stream):
if i < k:
append x to R
else:
j = random integer from 0 to i
if j < k:
R[j] = x
return R
Example
Let:
$$ stream = [a, b, c, d, e] $$
and:
$$ k = 2 $$
The reservoir first becomes:
$$ [a, b] $$
When $c, d, e$ arrive, each new item has a chance to replace one existing reservoir item. After the stream ends, every 2-item subset has equal probability.
Correctness
For any item $x_i$, the probability that it enters the reservoir when processed is:
$$ \frac{k}{i} $$
using one-based position $i$.
After entering, it must avoid replacement by every later item $x_t$ for $t = i + 1, \dots, n$. At step $t$, the probability that a specific reservoir item is replaced is:
$$ \frac{1}{t} $$
so the probability it survives that step is:
$$ 1 - \frac{1}{t} $$
Therefore its final probability is:
$$ \frac{k}{i} \prod_{t=i+1}^{n}\left(1 - \frac{1}{t}\right)
\frac{k}{i} \prod_{t=i+1}^{n}\frac{t-1}{t}
\frac{k}{n} $$
Thus every item is included with equal probability $k/n$.
Complexity
| operation | cost |
|---|---|
| memory | $O(k)$ |
| update per item | $O(1)$ |
| total time | $O(n)$ |
The algorithm is online and does not need to know $n$ in advance.
When to Use
Use Reservoir Sampling when:
- the input is a stream
- the stream length is unknown
- storing all items is impossible or unnecessary
- a uniform sample is required
It is common in logging, telemetry, randomized testing, large file sampling, and database query processing.
Implementation
import random
def reservoir_sampling(stream, k):
reservoir = []
for i, x in enumerate(stream):
if i < k:
reservoir.append(x)
else:
j = random.randint(0, i)
if j < k:
reservoir[j] = x
return reservoir
import "math/rand"
func ReservoirSampling[T any](stream []T, k int) []T {
reservoir := make([]T, 0, k)
for i, x := range stream {
if i < k {
reservoir = append(reservoir, x)
continue
}
j := rand.Intn(i + 1)
if j < k {
reservoir[j] = x
}
}
return reservoir
}