MinHash Similarity Search
MinHash Similarity Search MinHash similarity search estimates the similarity between sets and retrieves candidates that are likely similar to a query. It is widely used for near duplicate detection when items are represented as sets, such as document shingles. The key idea is that MinHash signatures preserve Jaccard similarity in expectation. Problem Given a collection of sets and a query set $Q$, find sets $S$ such that: $$ \operatorname{Jaccard}(Q, S)...