External Merge Sort
Sort data that does not fit in memory by dividing it into sorted runs and merging them using external storage.
External Merge Sort
External merge sort is a disk-based sorting algorithm designed for datasets that exceed main memory. It minimizes random I/O and relies on sequential reads and writes, which are efficient on disks and object storage.
You use it when the input size is much larger than RAM, for example multi-gigabyte or terabyte scale data.
Model
Assume:
- Input size: $N$ elements
- Memory capacity: $M$ elements
- Block size: $B$ elements per I/O
The goal is to sort using minimal disk passes.
High-Level Idea
The algorithm has two phases:
-
Run generation Split data into chunks that fit in memory, sort each chunk, and write sorted runs to disk.
-
Merge phase Merge multiple sorted runs into one final sorted sequence.
Algorithm
Phase 1: Run Generation
Read chunks of size $M$, sort in memory, and write back.
generate_runs(file):
runs = []
while not end_of_file:
chunk = read_next_M_elements(file)
sort(chunk)
run_file = write_to_disk(chunk)
runs.append(run_file)
return runs
Phase 2: K-way Merge
Merge multiple runs simultaneously using a min-heap.
merge_runs(runs):
create min_heap
open all run files
for each run:
read first element and push (value, run_id) into heap
while heap not empty:
(value, r) = extract_min(heap)
output value
if run r has more elements:
read next element from r
push into heap
Example
Suppose:
- Memory holds 3 elements
- Input: [8, 3, 5, 1, 9, 2, 7]
Step 1: Runs
Chunks:
- [8, 3, 5] → [3, 5, 8]
- [1, 9, 2] → [1, 2, 9]
- [7] → [7]
Step 2: Merge
Merge:
[3,5,8], [1,2,9], [7]
Result:
[1,2,3,5,7,8,9]
Complexity
I/O Complexity
Number of passes:
$$ O\left(\log_k \frac{N}{M}\right) $$
where $k$ is number of runs merged at once.
Total I/O:
$$ O\left(\frac{N}{B} \log_k \frac{N}{M}\right) $$
CPU Complexity
Each element participates in heap operations:
$$ O(N \log k) $$
Design Choices
| parameter | effect |
|---|---|
| run size $M$ | larger runs reduce merge passes |
| merge degree $k$ | larger $k$ reduces passes but increases heap cost |
| block size $B$ | larger blocks reduce I/O overhead |
Optimizations
-
Replacement selection Produces runs larger than memory, often about $2M$.
-
Buffered I/O Use read/write buffers to reduce system calls.
-
Multiway merge Merge many runs at once instead of pairwise merging.
-
Loser tree Replace heap with tournament tree for faster merging.
When to Use
External merge sort is appropriate when:
- data does not fit in RAM
- storage is disk, SSD, or object store
- sequential I/O is much cheaper than random access
It is the standard sorting method in databases, large-scale data processing systems, and distributed frameworks.
Implementation (simplified Python)
import heapq
def merge_sorted_files(files):
heap = []
iters = [iter(f) for f in files]
for i, it in enumerate(iters):
try:
value = next(it)
heapq.heappush(heap, (value, i))
except StopIteration:
pass
result = []
while heap:
value, i = heapq.heappop(heap)
result.append(value)
try:
nxt = next(iters[i])
heapq.heappush(heap, (nxt, i))
except StopIteration:
pass
return result
Notes
External merge sort underlies many systems:
- database sort operators
- MapReduce shuffle phase
- large-scale log processing
The key principle is simple: keep memory usage bounded, push bulk data to sequential disk operations, and merge efficiently.