Hash Join Lookup

Hash join lookup is the probe phase of a hash join. One relation is first scanned and stored in a hash table by join key. The other relation is then scanned, and each row probes the hash table to find matching rows.

You use it when joining two datasets and the join key can be hashed efficiently.

Problem

Given two relations $R$ and $S$, find all pairs of rows whose join keys are equal:

$$ r.key = s.key $$

A hash join lookup assumes that one relation, usually the smaller one, has already been indexed by key.

Model

Build a hash table $H$ from relation $R$:

$$ H[r.key] \gets H[r.key] \cup {r} $$

Then, for each row $s \in S$, lookup:

$$ H[s.key] $$

Every row in that bucket is a candidate join match.

Algorithm

hash_join_lookup(H, s):
    bucket = H[s.key]

    for each row r in bucket:
        if r.key == s.key:
            emit join_pair(r, s)

The explicit equality check is still needed because different keys may collide into the same hash bucket.

Example

Build side $R$:

id	name
1	Ada
2	Linus
3	Grace

Probe side $S$:

id	order
2	Book
3	Pen
4	Bag

Hash table from $R$:

key	rows
1	Ada
2	Linus
3	Grace

Probe each row from $S$:

probe key	result
2	Linus joins with Book
3	Grace joins with Pen
4	no match

The output contains the matching pairs for keys 2 and 3.

Correctness

The build phase stores every row $r \in R$ in the bucket for its join key. During lookup, a probe row $s$ checks the bucket for $s.key$.

If a row $r$ satisfies $r.key = s.key$, then it is stored in exactly the bucket inspected by $s$. Therefore the lookup will find it. If a row in the bucket has a different key due to a hash collision, the equality check rejects it.

Complexity

Let $|R|$ be the build size and $|S|$ be the probe size.

phase	expected time
build	$O(	R	)$
probe	$O(	S	+ z)$
total	$O(	R	+	S	+ z)$

Here $z$ is the number of emitted join pairs.

In the worst case, poor hashing can put many rows in the same bucket, causing quadratic behavior.

Space

$$ O(|R|) $$

The hash table stores the build relation, usually the smaller input.

When to Use

Hash join lookup is appropriate when:

the join condition is equality
the build side fits in memory
input data is unsorted
many key lookups are required
output order is not important

For sorted inputs, sort merge join may be preferable. For indexed tables, index nested loop join may be better.

Implementation

from collections import defaultdict

def build_hash_table(rows, key):
    table = defaultdict(list)
    for row in rows:
        table[row[key]].append(row)
    return table

def hash_join(left, right, key):
    table = build_hash_table(left, key)
    result = []

    for r in right:
        for l in table.get(r[key], []):
            if l[key] == r[key]:
                result.append((l, r))

    return result

type Row map[string]string

func BuildHashTable(rows []Row, key string) map[string][]Row {
	table := make(map[string][]Row)

	for _, row := range rows {
		k := row[key]
		table[k] = append(table[k], row)
	}

	return table
}

func HashJoin(left []Row, right []Row, key string) [][2]Row {
	table := BuildHashTable(left, key)
	out := make([][2]Row, 0)

	for _, r := range right {
		k := r[key]
		for _, l := range table[k] {
			if l[key] == r[key] {
				out = append(out, [2]Row{l, r})
			}
		}
	}

	return out
}