.. _weighted_minhash: Weighted MinHash ================ MinHash can be used to compress unweighted `set `__ or binary vector, and estimate the `unweighted Jaccard similarity `__. It is possible to modify MinHash for `weighted Jaccard `__ on `**multisets** `__ by expanding each item (or dimension) by its weight (usually its count in the multiset). However this approach does not support real number weights, and doing so can be very expensive if the weights are very large. `Weighted MinHash `__ is created by Sergey Ioffe, and its performance does not depend on the weights - as long as the universe of all possible items (or dimension for vectors) is known. This makes it unsuitable for stream processing, when the knowledge of unseen items cannot be assumed. In this library, :class:`datasketch.WeightedMinHash` objects can only be created from vectors using :class:`datasketch.WeightedMinHashGenerator`, which takes the dimension as a required parameter. .. code:: python # Using default sample_size 256 and seed 1 wmg = WeightedMinHashGenerator(1000) You can specify the number of samples (similar to number of permutation functions in MinHash) and the random seed. .. code:: python wmg = WeightedMinHashGenerator(1000, sample_size=512, seed=12) Here is a usage example. .. code:: python from datasketch import WeightedMinHashGenerator v1 = [1, 3, 4, 5, 6, 7, 8, 9, 10, 4] v2 = [2, 4, 3, 8, 4, 7, 10, 9, 0, 0] # WeightedMinHashGenerator requires dimension as the first argument wmg = WeightedMinHashGenerator(len(v1)) wm1 = wmg.minhash(v1) # wm1 is of the type WeightedMinHash wm2 = wmg.minhash(v2) print("Estimated Jaccard is", wm1.jaccard(wm2)) It is possible to make :class:`datasketch.WeightedMinHash` have a ``update`` interface similar to :class:`MinHash` and use it for stream data processing. However, this makes the cost of ``update`` increase linearly with respect to the weight. Thus, ``update`` is not implemented for :class:`datasketch.WeightedMinHash` in this library. Weighted MinHash as similar accuracy and performance profiles as MinHash. As you increase the number of samples, you get better accuracy, at the expense of slower speed. .. figure:: /_static/weighted_minhash_benchmark.png :alt: Weighted MinHash Benchmark :class:`datasketch.MinHashLSH` and :class:`datasketch.MinHashLSHForest` can also be used to index :class:`datasketch.WeightedMinHash`. .. code:: python import numpy as np from datasketch import WeightedMinHashGenerator from datasketch import MinHashLSH v1 = np.random.uniform(1, 10, 10) v2 = np.random.uniform(1, 10, 10) v3 = np.random.uniform(1, 10, 10) mg = WeightedMinHashGenerator(10, 5) m1 = mg.minhash(v1) m2 = mg.minhash(v2) m3 = mg.minhash(v3) # Create weighted MinHash LSH index lsh = MinHashLSH(threshold=0.1, sample_size=5) lsh.insert("m2", m2) lsh.insert("m3", m3) result = lsh.query(m1) print("Approximate neighbours with weighted Jaccard similarity > 0.1", result)