tag

#simhash

1 pages tagged

Wiki › Algorithms › 01. Searching and Sorting › 3. Hashing and Table Search ›

SimHash Near Duplicate Search

SimHash Near Duplicate Search SimHash near duplicate search uses compact bit fingerprints to detect items with similar weighted features. It is commonly used for near duplicate documents, pages, records, and text fragments. The main idea is to convert a high dimensional feature vector into a fixed width fingerprint, usually 64 bits. Similar inputs tend to produce fingerprints with small Hamming distance. Problem Given a collection of items and a query...