Two Phase Multiway Merge Sort
External sorting algorithm that performs run generation and a single multiway merge using limited memory.
Two Phase Multiway Merge Sort
Two Phase Multiway Merge Sort (TPMMS) is a practical external sorting algorithm used in database systems. It assumes that all sorted runs can be merged in a single pass using available memory buffers.
The method minimizes the number of disk passes, which dominates cost in external memory settings.
Model
Let:
- $N$ = total number of elements
- $M$ = memory capacity
- $B$ = block size
Memory is divided into buffers:
- 1 output buffer
- $k$ input buffers
- Total buffers $\leq M / B$
High-Level Structure
TPMMS consists of exactly two phases:
- Run generation
- Single multiway merge
The key constraint:
$$ \frac{N}{M} \leq k $$
so all runs can be merged in one pass.
Phase 1: Run Generation
Split input into chunks that fit into memory, sort each chunk, and write back as runs.
generate_runs(file):
runs = []
while not end_of_file:
chunk = read_next_M_elements(file)
sort(chunk)
run_file = write_to_disk(chunk)
runs.append(run_file)
return runs
Number of runs:
$$ R = \left\lceil \frac{N}{M} \right\rceil $$
Phase 2: Multiway Merge
Merge all runs simultaneously using $k$ input buffers and one output buffer.
multiway_merge(runs):
initialize k input buffers
initialize output buffer
build min-heap with first element from each run
while heap not empty:
(value, r) = extract_min(heap)
append value to output buffer
if output buffer full:
flush to disk
if run r has more data:
read next element from buffer
push into heap
Example
Suppose:
- Memory holds 3 elements
- Input: [8, 3, 5, 1, 9, 2, 7]
Phase 1
Runs:
- [3, 5, 8]
- [1, 2, 9]
- [7]
Phase 2
Merge all runs in one pass:
Result:
[1, 2, 3, 5, 7, 8, 9]
Complexity
I/O Cost
Phase 1 reads and writes all data once:
$$ 2 \cdot \frac{N}{B} $$
Phase 2 reads and writes once:
$$ 2 \cdot \frac{N}{B} $$
Total:
$$ O\left(\frac{N}{B}\right) $$
with two full passes.
CPU Cost
Heap operations:
$$ O(N \log R) $$
where $R$ is number of runs.
Constraint for Two-Phase Property
To ensure only one merge pass:
$$ R \leq k \quad \Rightarrow \quad \frac{N}{M} \leq \frac{M}{B} $$
which gives:
$$ N \leq \frac{M^2}{B} $$
This is the classic TPMMS condition.
Design Notes
| aspect | implication |
|---|---|
| fewer runs | larger memory improves performance |
| larger merge fan-in | fewer passes |
| buffer size | affects I/O efficiency |
When to Use
TPMMS is appropriate when:
- dataset satisfies $N \leq M^2 / B$
- you want exactly two passes over data
- you build database operators like ORDER BY or sort-merge join
If the condition fails, additional merge passes are required.
Implementation Sketch
import heapq
def multiway_merge(runs):
heap = []
iters = [iter(r) for r in runs]
for i, it in enumerate(iters):
try:
heapq.heappush(heap, (next(it), i))
except StopIteration:
pass
result = []
while heap:
val, i = heapq.heappop(heap)
result.append(val)
try:
heapq.heappush(heap, (next(iters[i]), i))
except StopIteration:
pass
return result
Notes
TPMMS is the standard model for:
- relational database sorting
- external GROUP BY and ORDER BY
- sort-merge join preparation
Its importance comes from the guarantee of minimal passes when memory is sufficient.