SimHash Near Duplicate Search

Find near duplicate documents by comparing compact SimHash fingerprints under Hamming distance.

SimHash Near Duplicate Search

SimHash near duplicate search uses compact bit fingerprints to detect items with similar weighted features. It is commonly used for near duplicate documents, pages, records, and text fragments.

The main idea is to convert a high dimensional feature vector into a fixed width fingerprint, usually 64 bits. Similar inputs tend to produce fingerprints with small Hamming distance.

Problem

Given a collection of items and a query item $q$, find stored items whose SimHash fingerprints differ from $q$ by at most $t$ bits.

Formally, find all items $x$ such that:

$$ \operatorname{HammingDistance}(\operatorname{simhash}(q), \operatorname{simhash}(x)) \le t $$

Model

Each item is represented as weighted features:

$$ (f_1, w_1), (f_2, w_2), \dots, (f_n, w_n) $$

To compute a $b$ bit SimHash:

  1. Initialize an array $v$ of length $b$ with zeros.
  2. Hash each feature to a $b$ bit value.
  3. For each bit position, add the feature weight if the bit is 1, otherwise subtract the weight.
  4. Set output bit $j$ to 1 if $v[j] \ge 0$, otherwise 0.

Algorithm

simhash_search(index, query, t):
    fp = simhash(query)
    candidates = lookup_candidate_buckets(index, fp)

    result = empty list
    for each item in candidates:
        if hamming_distance(fp, item.fingerprint) <= t:
            append item to result

    return result

The candidate lookup step usually partitions the fingerprint into bands. If two fingerprints are within a small Hamming distance, they must share at least one band under suitable partitioning.

Example

Suppose fingerprints are 8 bits:

item fingerprint
query 11010110
doc1 11010100
doc2 01000111
doc3 11011110

Compare by Hamming distance:

item distance from query
doc1 1
doc2 3
doc3 1

For threshold $t = 1$, return doc1 and doc3.

Correctness

SimHash search has two layers.

The verification layer is exact for the fingerprints: every candidate whose Hamming distance is at most $t$ is accepted, and every candidate above $t$ is rejected.

The candidate generation layer may be approximate, depending on the indexing scheme. A full scan over all fingerprints is exact for the chosen SimHash threshold. A banded index is faster but must be configured carefully to avoid missing valid near duplicates.

Complexity

Let:

  • $b$ be the fingerprint width
  • $C$ be the number of candidates
  • $n$ be the number of stored items
operation time
full scan query $O(n)$
indexed query $O(C)$
Hamming check $O(1)$ for fixed width fingerprints

Space

$$ O(n) $$

Each item stores a fixed width fingerprint plus optional bucket index entries.

When to Use

SimHash near duplicate search is appropriate when:

  • items are represented by weighted features
  • near duplicate detection is needed
  • compact fingerprints are preferred
  • small differences should preserve similarity
  • approximate candidate generation is acceptable

It is common in web crawling, document deduplication, search indexing, spam detection, and content clustering.

Implementation

from collections import defaultdict

def feature_hash(feature):
    return hash(feature) & ((1 << 64) - 1)

def simhash(features):
    weights = [0] * 64

    for feature, weight in features:
        h = feature_hash(feature)
        for bit in range(64):
            if (h >> bit) & 1:
                weights[bit] += weight
            else:
                weights[bit] -= weight

    fp = 0
    for bit, value in enumerate(weights):
        if value >= 0:
            fp |= 1 << bit

    return fp

def hamming_distance(a, b):
    return (a ^ b).bit_count()

class SimHashIndex:
    def __init__(self, bands=4, band_bits=16):
        self.bands = bands
        self.band_bits = band_bits
        self.buckets = defaultdict(list)

    def _band_key(self, fp, band):
        mask = (1 << self.band_bits) - 1
        return (band, (fp >> (band * self.band_bits)) & mask)

    def add(self, item_id, features):
        fp = simhash(features)
        for band in range(self.bands):
            self.buckets[self._band_key(fp, band)].append((item_id, fp))

    def search(self, features, threshold):
        fp = simhash(features)
        candidates = {}

        for band in range(self.bands):
            key = self._band_key(fp, band)
            for item_id, other_fp in self.buckets.get(key, []):
                candidates[item_id] = other_fp

        result = []
        for item_id, other_fp in candidates.items():
            if hamming_distance(fp, other_fp) <= threshold:
                result.append(item_id)

        return result
type Feature struct {
	Name   string
	Weight int
}

type SimHashItem struct {
	ID string
	FP uint64
}

func hashFeature(s string) uint64 {
	var h uint64 = 1469598103934665603
	for i := 0; i < len(s); i++ {
		h ^= uint64(s[i])
		h *= 1099511628211
	}
	return h
}

func SimHash(features []Feature) uint64 {
	var weights [64]int

	for _, f := range features {
		h := hashFeature(f.Name)
		for bit := 0; bit < 64; bit++ {
			if (h>>uint(bit))&1 == 1 {
				weights[bit] += f.Weight
			} else {
				weights[bit] -= f.Weight
			}
		}
	}

	var fp uint64
	for bit := 0; bit < 64; bit++ {
		if weights[bit] >= 0 {
			fp |= uint64(1) << uint(bit)
		}
	}

	return fp
}

func HammingDistance(a, b uint64) int {
	x := a ^ b
	count := 0
	for x != 0 {
		x &= x - 1
		count++
	}
	return count
}