Sample Sort

Divide and conquer sorting algorithm that uses sampling to partition data into balanced buckets.

Sample Sort

Sample sort is a generalization of quicksort and bucket sort. It selects a sample of elements, uses them to define partition boundaries, and distributes the data into multiple buckets that are sorted independently.

It is widely used in parallel and distributed systems because it produces well balanced partitions.

Problem

Given an array $A$ of length $n$, reorder it such that:

$$ A[0] \le A[1] \le \dots \le A[n-1] $$

Idea

Instead of choosing a single pivot, sample sort chooses multiple splitters:

  1. take a random sample from the array
  2. sort the sample
  3. choose evenly spaced elements as splitters
  4. partition the array into buckets using splitters
  5. sort each bucket recursively

This creates $k$ partitions instead of two, reducing recursion depth.

Algorithm

sample_sort(A):
    if length(A) small:
        use insertion sort
        return

    sample = select random elements from A
    sort(sample)

    splitters = choose k evenly spaced elements from sample

    buckets = partition A into k+1 buckets using splitters

    for each bucket:
        sample_sort(bucket)

    concatenate buckets

Example

Let:

$$ A = [9, 2, 7, 4, 6, 1, 8, 3, 5] $$

Sample:

$$ [2, 7, 4, 8] $$

Sorted sample:

$$ [2, 4, 7, 8] $$

Choose splitters:

$$ [4, 7] $$

Partition:

bucket values
< 4 [2, 1, 3]
4 to 7 [4, 6, 5]
> 7 [9, 8]

Recursively sort each bucket:

Final result:

$$ [1,2,3,4,5,6,7,8,9] $$

Correctness

The splitters divide the array into ranges such that all elements in earlier buckets are less than or equal to those in later buckets. Recursively sorting each bucket ensures that each bucket is internally sorted. Concatenating the sorted buckets yields a globally sorted array.

Complexity

case time
average $O(n \log n)$
worst $O(n^2)$

With good sampling:

  • partitions are balanced
  • recursion depth is reduced

In parallel settings:

metric value
work $O(n \log n)$
depth $O(\log n)$

Properties

property value
stable depends on implementation
in-place no
parallel friendly very high
partitioning multiway

Notes

Sample sort is widely used in high performance and distributed sorting systems. It provides better load balancing than quicksort because multiple splitters create more uniform partitions.

It is a key algorithm in parallel sorting frameworks and large scale data processing systems.

Implementation

import random

def sample_sort(a):
    if len(a) <= 16:
        return sorted(a)

    k = int(len(a) ** 0.5)

    sample = random.sample(a, min(len(a), k * 2))
    sample.sort()

    splitters = [sample[i * len(sample) // (k + 1)] for i in range(1, k + 1)]

    buckets = [[] for _ in range(k + 1)]

    for x in a:
        placed = False
        for i, s in enumerate(splitters):
            if x <= s:
                buckets[i].append(x)
                placed = True
                break
        if not placed:
            buckets[-1].append(x)

    result = []
    for b in buckets:
        result.extend(sample_sort(b))

    return result