MinHash can be used to compress unweighted set or binary vector, and estimate the unweighted Jaccard similarity. It is possible to modify MinHash for weighted Jaccard on **multisets** by expanding each item (or dimension) by its weight (usually its count in the multiset). However this approach does not support real number weights, and doing so can be very expensive if the weights are very large. Weighted MinHash is created by Sergey Ioffe, and its performance does not depend on the weights - as long as the universe of all possible items (or dimension for vectors) is known. This makes it unsuitable for stream processing, when the knowledge of unseen items cannot be assumed.
# Using default sample_size 256 and seed 1 wmg = WeightedMinHashGenerator(1000)
You can specify the number of samples (similar to number of permutation functions in MinHash) and the random seed.
wmg = WeightedMinHashGenerator(1000, sample_size=512, seed=12)
Here is a usage example.
from datasketch import WeightedMinHashGenerator v1 = [1, 3, 4, 5, 6, 7, 8, 9, 10, 4] v2 = [2, 4, 3, 8, 4, 7, 10, 9, 0, 0] # WeightedMinHashGenerator requires dimension as the first argument wmg = WeightedMinHashGenerator(len(v1)) wm1 = wmg.minhash(v1) # wm1 is of the type WeightedMinHash wm2 = wmg.minhash(v2) print("Estimated Jaccard is", wm1.jaccard(wm2))
It is possible to make
datasketch.WeightedMinHash have a
MinHash and use it for stream data processing. However,
this makes the cost of
update increase linearly with respect to the
update is not implemented for
Weighted MinHash as similar accuracy and performance profiles as MinHash. As you increase the number of samples, you get better accuracy, at the expense of slower speed.
import numpy as np from datasketch import WeightedMinHashGenerator from datasketch import MinHashLSH v1 = np.random.uniform(1, 10, 10) v2 = np.random.uniform(1, 10, 10) v3 = np.random.uniform(1, 10, 10) mg = WeightedMinHashGenerator(10, 5) m1 = mg.minhash(v1) m2 = mg.minhash(v2) m3 = mg.minhash(v3) # Create weighted MinHash LSH index lsh = MinHashLSH(threshold=0.1, sample_size=5) lsh.insert("m2", m2) lsh.insert("m3", m3) result = lsh.query(m1) print("Approximate neighbours with weighted Jaccard similarity > 0.1", result)