External Merge Sort

External merge sort is a disk-based sorting algorithm designed for datasets that exceed main memory. It minimizes random I/O and relies on sequential reads and writes, which are efficient on disks and object storage.

You use it when the input size is much larger than RAM, for example multi-gigabyte or terabyte scale data.

Model

Assume:

Input size: $N$ elements
Memory capacity: $M$ elements
Block size: $B$ elements per I/O

The goal is to sort using minimal disk passes.

High-Level Idea

The algorithm has two phases:

Run generation Split data into chunks that fit in memory, sort each chunk, and write sorted runs to disk.
Merge phase Merge multiple sorted runs into one final sorted sequence.

Algorithm

Phase 1: Run Generation

Read chunks of size $M$, sort in memory, and write back.

generate_runs(file):
    runs = []
    while not end_of_file:
        chunk = read_next_M_elements(file)
        sort(chunk)
        run_file = write_to_disk(chunk)
        runs.append(run_file)
    return runs

Phase 2: K-way Merge

Merge multiple runs simultaneously using a min-heap.

merge_runs(runs):
    create min_heap
    open all run files

    for each run:
        read first element and push (value, run_id) into heap

    while heap not empty:
        (value, r) = extract_min(heap)
        output value

        if run r has more elements:
            read next element from r
            push into heap

Example

Suppose:

Memory holds 3 elements
Input: [8, 3, 5, 1, 9, 2, 7]

Step 1: Runs

Chunks:

[8, 3, 5] → [3, 5, 8]
[1, 9, 2] → [1, 2, 9]
[7] → [7]

Step 2: Merge

Merge:

[3,5,8], [1,2,9], [7]

Result:

[1,2,3,5,7,8,9]

Complexity

I/O Complexity

Number of passes:

$$ O\left(\log_k \frac{N}{M}\right) $$

where $k$ is number of runs merged at once.

Total I/O:

$$ O\left(\frac{N}{B} \log_k \frac{N}{M}\right) $$

CPU Complexity

Each element participates in heap operations:

$$ O(N \log k) $$

Design Choices

parameter	effect
run size $M$	larger runs reduce merge passes
merge degree $k$	larger $k$ reduces passes but increases heap cost
block size $B$	larger blocks reduce I/O overhead

Optimizations

Replacement selection Produces runs larger than memory, often about $2M$.
Buffered I/O Use read/write buffers to reduce system calls.
Multiway merge Merge many runs at once instead of pairwise merging.
Loser tree Replace heap with tournament tree for faster merging.

When to Use

External merge sort is appropriate when:

data does not fit in RAM
storage is disk, SSD, or object store
sequential I/O is much cheaper than random access

It is the standard sorting method in databases, large-scale data processing systems, and distributed frameworks.

Implementation (simplified Python)

import heapq

def merge_sorted_files(files):
    heap = []
    iters = [iter(f) for f in files]

    for i, it in enumerate(iters):
        try:
            value = next(it)
            heapq.heappush(heap, (value, i))
        except StopIteration:
            pass

    result = []

    while heap:
        value, i = heapq.heappop(heap)
        result.append(value)
        try:
            nxt = next(iters[i])
            heapq.heappush(heap, (nxt, i))
        except StopIteration:
            pass

    return result

Notes

External merge sort underlies many systems:

database sort operators
MapReduce shuffle phase
large-scale log processing

The key principle is simple: keep memory usage bounded, push bulk data to sequential disk operations, and merge efficiently.