# Weighted MinHash

MinHash can be used to compress unweighted set or binary vector, and estimate the unweighted Jaccard similarity. It is possible to modify MinHash for weighted Jaccard on **multisets** by expanding each item (or dimension) by its weight (usually its count in the multiset). However this approach does not support real number weights, and doing so can be very expensive if the weights are very large. Weighted MinHash is created by Sergey Ioffe, and its performance does not depend on the weights - as long as the universe of all possible items (or dimension for vectors) is known. This makes it unsuitable for stream processing, when the knowledge of unseen items cannot be assumed.

In this library, `datasketch.WeightedMinHash`

objects can only be created from
vectors using `datasketch.WeightedMinHashGenerator`

, which takes the dimension as
a required parameter.

```
# Using default sample_size 256 and seed 1
wmg = WeightedMinHashGenerator(1000)
```

You can specify the number of samples (similar to number of permutation functions in MinHash) and the random seed.

```
wmg = WeightedMinHashGenerator(1000, sample_size=512, seed=12)
```

Here is a usage example.

```
from datasketch import WeightedMinHashGenerator
v1 = [1, 3, 4, 5, 6, 7, 8, 9, 10, 4]
v2 = [2, 4, 3, 8, 4, 7, 10, 9, 0, 0]
# WeightedMinHashGenerator requires dimension as the first argument
wmg = WeightedMinHashGenerator(len(v1))
wm1 = wmg.minhash(v1) # wm1 is of the type WeightedMinHash
wm2 = wmg.minhash(v2)
print("Estimated Jaccard is", wm1.jaccard(wm2))
```

It is possible to make `datasketch.WeightedMinHash`

have a `update`

interface
similar to `MinHash`

and use it for stream data processing. However,
this makes the cost of `update`

increase linearly with respect to the
weight. Thus, `update`

is not implemented for `datasketch.WeightedMinHash`

in
this library.

Weighted MinHash as similar accuracy and performance profiles as MinHash. As you increase the number of samples, you get better accuracy, at the expense of slower speed.

`datasketch.MinHashLSH`

and
`datasketch.MinHashLSHForest`

can also be used to index `datasketch.WeightedMinHash`

.

```
import numpy as np
from datasketch import WeightedMinHashGenerator
from datasketch import MinHashLSH
v1 = np.random.uniform(1, 10, 10)
v2 = np.random.uniform(1, 10, 10)
v3 = np.random.uniform(1, 10, 10)
mg = WeightedMinHashGenerator(10, 5)
m1 = mg.minhash(v1)
m2 = mg.minhash(v2)
m3 = mg.minhash(v3)
# Create weighted MinHash LSH index
lsh = MinHashLSH(threshold=0.1, sample_size=5)
lsh.insert("m2", m2)
lsh.insert("m3", m3)
result = lsh.query(m1)
print("Approximate neighbours with weighted Jaccard similarity > 0.1", result)
```