API Documentation
MinHash
- class datasketch.MinHash(num_perm: int = 128, seed: int = 1, hashfunc: ~typing.Callable = <function sha1_hash32>, hashobj: object | None = None, hashvalues: ~typing.Iterable | None = None, permutations: ~typing.Tuple[~typing.Iterable, ~typing.Iterable] | None = None)[source]
MinHash is a probabilistic data structure for computing Jaccard similarity between sets.
- Parameters:
num_perm (int) – Number of random permutation functions. It will be ignored if hashvalues is not None.
seed (int) – The random seed controls the set of random permutation functions generated for this MinHash.
hashfunc (Callable) – The hash function used by this MinHash. It takes the input passed to the
update()
method and returns an integer that can be encoded with 32 bits. The default hash function is based on SHA1 from hashlib. Users can use farmhash for better performance. See the example inupdate()
.hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.
hashvalues (Optional[Iterable]) – The hash values is the internal state of the MinHash. It can be specified for faster initialization using the existing
hashvalues
of another MinHash.permutations (Optional[Tuple[Iterable, Iterable]]) – The permutation function parameters as a tuple of two lists. This argument can be specified for faster initialization using the existing
permutations
from another MinHash.
Note
To save memory usage, consider using
datasketch.LeanMinHash
.Note
Since version 1.1.1, MinHash will only support serialization using pickle.
serialize
anddeserialize
methods are removed, and are supported indatasketch.LeanMinHash
instead. MinHash serialized before version 1.1.1 cannot be deserialized properly in newer versions (need to migrate?).Note
Since version 1.1.3, MinHash uses Numpy’s random number generator instead of Python’s built-in random package. This change makes the hash values consistent across different Python versions. The side-effect is that now MinHash created before version 1.1.3 won’t work (i.e.,
jaccard()
,merge()
andunion()
) with those created after.- __init__(num_perm: int = 128, seed: int = 1, hashfunc: ~typing.Callable = <function sha1_hash32>, hashobj: object | None = None, hashvalues: ~typing.Iterable | None = None, permutations: ~typing.Tuple[~typing.Iterable, ~typing.Iterable] | None = None) None [source]
- update(b) None [source]
Update this MinHash with a new value. The value will be hashed using the hash function specified by the hashfunc argument in the constructor.
- Parameters:
b – The value to be hashed using the hash function specified.
Example
To update with a new string value (using the default SHA1 hash function, which requires bytes as input):
minhash = Minhash() minhash.update("new value".encode('utf-8'))
We can also use a different hash function, for example, pyfarmhash:
import farmhash def _hash_32(b): return farmhash.hash32(b) minhash = MinHash(hashfunc=_hash_32) minhash.update("new value")
- update_batch(b: Iterable) None [source]
Update this MinHash with new values. The values will be hashed using the hash function specified by the hashfunc argument in the constructor.
- Parameters:
b (Iterable) – Values to be hashed using the hash function specified.
Example
To update with new string values (using the default SHA1 hash function, which requires bytes as input):
minhash = Minhash() minhash.update_batch([s.encode('utf-8') for s in ["token1", "token2"]])
- jaccard(other: MinHash) float [source]
Estimate the Jaccard similarity (resemblance) between the sets represented by this MinHash and the other.
- Parameters:
other (MinHash) – The other MinHash.
- Returns:
The Jaccard similarity, which is between 0.0 and 1.0.
- Return type:
float
- Raises:
ValueError – If the two MinHashes have different numbers of permutation functions or different seeds.
- count() float [source]
Estimate the cardinality count based on the technique described in this paper.
- Returns:
The estimated cardinality of the set represented by this MinHash.
- Return type:
int
- merge(other: MinHash) None [source]
Merge the other MinHash with this one, making this one the union of both.
- Parameters:
other (MinHash) – The other MinHash.
- Raises:
ValueError – If the two MinHashes have different numbers of permutation functions or different seeds.
- digest() ndarray [source]
Export the hash values, which is the internal state of the MinHash.
- Returns:
The hash values which is a Numpy array.
- Return type:
numpy.ndarray
- is_empty() bool [source]
- Returns:
- If the current MinHash is empty - at the state of just
initialized.
- Return type:
bool
- __eq__(other: MinHash) bool [source]
- Returns:
If their seeds and hash values are both equal then two are equivalent.
- Return type:
bool
- classmethod union(*mhs: MinHash) MinHash [source]
Create a MinHash which is the union of the MinHash objects passed as arguments.
- Parameters:
*mhs (MinHash) – The MinHash objects to be united. The argument list length is variable, but must be at least 2.
- Returns:
a new union MinHash.
- Return type:
- Raises:
ValueError – If the number of MinHash objects passed as arguments is less than 2, or if the MinHash objects passed as arguments have different seeds or different numbers of permutation functions.
Example
from datasketch import MinHash import numpy as np m1 = MinHash(num_perm=128) m1.update_batch(np.random.randint(low=0, high=30, size=10)) m2 = MinHash(num_perm=128) m2.update_batch(np.random.randint(low=0, high=30, size=10)) # Union m1 and m2. m = MinHash.union(m1, m2)
- __hash__ = None
- __weakref__
list of weak references to the object
- classmethod bulk(b: Iterable, **minhash_kwargs) List[MinHash] [source]
Compute MinHashes in bulk. This method avoids unnecessary overhead when initializing many minhashes by reusing the initialized state.
- Parameters:
b (Iterable) – An Iterable of lists of bytes, each list is hashed in to one MinHash in the output.
**minhash_kwargs – Keyword arguments used to initialize MinHash, will be used for all minhashes.
- Returns:
A list of computed MinHashes.
- Return type:
List[datasketch.MinHash]
Example
from datasketch import MinHash data = [[b'token1', b'token2', b'token3'], [b'token4', b'token5', b'token6']] minhashes = MinHash.bulk(data, num_perm=64)
- classmethod generator(b: Iterable, **minhash_kwargs) Generator[MinHash, None, None] [source]
Compute MinHashes in a generator. This method avoids unnecessary overhead when initializing many minhashes by reusing the initialized state.
- Parameters:
b (Iterable) – An Iterable of lists of bytes, each list is hashed in to one MinHash in the output.
minhash_kwargs – Keyword arguments used to initialize MinHash, will be used for all minhashes.
- Returns:
a generator of computed MinHashes.
- Return type:
Generator[MinHash, None, None]
Example
from datasketch import MinHash data = [[b'token1', b'token2', b'token3'], [b'token4', b'token5', b'token6']] for minhash in MinHash.generator(data, num_perm=64): # do something useful minhash
Lean MinHash
- class datasketch.LeanMinHash(minhash: MinHash = None, seed: int = None, hashvalues: Iterable = None)[source]
Lean MinHash is MinHash with a smaller memory footprint and faster deserialization, but with its internal state frozen – no update().
Lean MinHash inherits all methods from
datasketch.MinHash
. It does not store the permutations and the hashfunc needed for updating. If a MinHash does not need further updates, convert it into a lean MinHash to save memory.Example
To create a lean MinHash from an existing MinHash:
lean_minhash = LeanMinHash(minhash) # You can compute the Jaccard similarity between two lean MinHash lean_minhash.jaccard(lean_minhash2) # Or between a lean MinHash and a MinHash lean_minhash.jaccard(minhash2)
To create a lean MinHash from the hash values and seed of an existing MinHash:
lean_minhash = LeanMinHash(seed=minhash.seed, hashvalues=minhash.hashvalues)
To create a MinHash from a lean MinHash:
minhash = MinHash(seed=lean_minhash.seed, hashvalues=lean_minhash.hashvalues) # Or if you want to prevent further updates on minhash # from affecting the state of lean_minhash minhash = MinHash(seed=lean_minhash.seed, hashvalues=lean_minhash.digest())
Note
Lean MinHash can also be used in
datasketch.MinHashLSH
,datasketch.MinHashLSHForest
, anddatasketch.MinHashLSHEnsemble
.- Parameters:
minhash (optional) – The
datasketch.MinHash
object used to initialize the LeanMinHash. If this is not set, then seed and hashvalues must be set.seed (optional) – The random seed that controls the set of random permutation functions generated for this LeanMinHash. This parameter must be used together with hashvalues.
hashvalues (optional) – The hash values used to inititialize the state of the LeanMinHash. This parameter must be used together with seed.
- update(b) None [source]
This method is not available on a LeanMinHash. Calling it raises a TypeError.
- copy() LeanMinHash [source]
- Returns:
a copy of this MinHash by exporting its state.
- Return type:
- bytesize(byteorder='@') int [source]
Compute the byte size after serialization.
- Parameters:
byteorder (str, optional) – This is byte order of the serialized data. Use one of the byte order characters:
@
,=
,<
,>
, and!
. Default is@
– the native order.- Returns:
Size in number of bytes after serialization.
- Return type:
int
- serialize(buf, byteorder='@') None [source]
Serialize this lean MinHash and store the result in an allocated buffer.
- Parameters:
buf (buffer) – buf must implement the buffer interface. One such example is the built-in bytearray class.
byteorder (str, optional) –
This is byte order of the serialized data. Use one of the byte order characters:
@
,=
,<
,>
, and!
. Default is@
– the native order.
This is preferred over using pickle if the serialized lean MinHash needs to be used by another program in a different programming language.
- The serialization schema:
The first 8 bytes is the seed integer
The next 4 bytes is the number of hash values
The rest is the serialized hash values, each uses 4 bytes
Example
To serialize a single lean MinHash into a bytearray buffer.
buf = bytearray(lean_minhash.bytesize()) lean_minhash.serialize(buf)
To serialize multiple lean MinHash into a bytearray buffer.
# assuming lean_minhashs is a list of LeanMinHash with the same size size = lean_minhashs[0].bytesize() buf = bytearray(size*len(lean_minhashs)) for i, lean_minhash in enumerate(lean_minhashs): lean_minhash.serialize(buf[i*size:])
- classmethod deserialize(buf, byteorder='@') LeanMinHash [source]
Deserialize a lean MinHash from a buffer.
- Parameters:
buf (buffer) – buf must implement the buffer interface. One such example is the built-in bytearray class.
byteorder (str. optional) –
This is byte order of the serialized data. Use one of the byte order characters:
@
,=
,<
,>
, and!
. Default is@
– the native order.
- Returns:
The deserialized lean MinHash
- Return type:
Example
To deserialize a lean MinHash from a buffer.
lean_minhash = LeanMinHash.deserialize(buf)
- classmethod union(*lmhs: LeanMinHash) LeanMinHash [source]
Create a new lean MinHash by unioning multiple lean MinHash.
Weighted MinHash
- class datasketch.WeightedMinHashGenerator(dim: int, sample_size: int = 128, seed: int = 1)[source]
The weighted MinHash generator is used for creating new
WeightedMinHash
objects.This weighted MinHash implementation is based on Sergey Ioffe’s paper, Improved Consistent Sampling, Weighted Minhash and L1 Sketching
- Parameters:
dim (int) – The number of dimensions of the input Jaccard vectors.
sample_size (int) – The number of samples to use for creating weighted MinHash.
seed (int) – The random seed to use for generating permutation functions.
- minhash(v: ndarray) WeightedMinHash [source]
Create a new weighted MinHash given a weighted Jaccard vector. Each dimension is an integer frequency of the corresponding element in the multi-set represented by the vector.
- Parameters:
v (numpy.ndarray) – The Jaccard vector.
- Returns:
The weighted MinHash.
- Return type:
- minhash_many(X: spmatrix | ndarray) List[WeightedMinHash | None] [source]
Create new WeightedMinHash instances given a matrix of weighted Jaccard vectors. In the input matrix X, each row corresponds to a multi-set, and each column stores the integer frequency of the element of a dimension.
Note
This method is experimental and does not yield the same MinHash hash values as
minhash()
.- Parameters:
X (Union[scipy.sparse.spmatrix, numpy.ndarray]) – A matrix of Jaccard vectors (rows).
- Returns:
List[Union[WeightedMinHash, None]] - A list of length X.shape[0]. Each element is either a
WeightedMinHash
instance or None (if the original row in X is empty).
- __weakref__
list of weak references to the object
- class datasketch.WeightedMinHash(seed: int, hashvalues: ndarray)[source]
New weighted MinHash is generated by
WeightedMinHashGenerator
. You can also initialize a weighted MinHash by using the state from an existing one.- Parameters:
seed (int) – The random seed used to generate this weighted MinHash.
hashvalues (numpy.ndarray) – The internal state of this weighted MinHash.
- jaccard(other: WeightedMinHash) float [source]
Estimate the weighted Jaccard similarity between the multi-sets represented by this weighted MinHash and the other.
- Parameters:
other (WeightedMinHash) – The other weighted MinHash.
- Returns:
The weighted Jaccard similarity between 0.0 and 1.0.
- Return type:
float
- Raises:
ValueError – If the two weighted MinHash objects have different seeds or different numbers of hash values.
- digest() ndarray [source]
Export the hash values, which is the internal state of the weighted MinHash.
- Returns:
The hash values which is a Numpy array.
- Return type:
numpy.ndarray
- copy() WeightedMinHash [source]
- Returns:
A copy of this weighted MinHash by exporting its state.
- Return type:
- __eq__(other) bool [source]
- Returns:
If their seeds and hash values are both equal then two are equivalent.
- Return type:
bool
- __hash__ = None
- __weakref__
list of weak references to the object
MinHash LSH
- class datasketch.MinHashLSH(threshold: float = 0.9, num_perm: int = 128, weights: Tuple[float, float] = (0.5, 0.5), params: Tuple[int, int] | None = None, storage_config: Dict | None = None, prepickle: bool | None = None, hashfunc: Callable[[bytes], bytes] | None = None)[source]
The MinHash LSH index. It supports query with Jaccard similarity threshold. Reference: Chapter 3, Mining of Massive Datasets.
- Parameters:
threshold (float) – The Jaccard similarity threshold between 0.0 and 1.0. The initialized MinHash LSH will be optimized for the threshold by minizing the false positive and false negative.
num_perm (int) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
weights (Tuple[float, float]) – Used to adjust the relative importance of minimizing false positive and false negative when optimizing for the Jaccard similarity threshold. weights is a tuple in the format of
(false_positive_weight, false_negative_weight)
.params (Optiona[Tuple[int, int]]) – The LSH parameters (i.e., number of bands and size of each bands). This is used to bypass the parameter optimization step in the constructor. threshold and weights will be ignored if this is given.
storage_config (Optional[Dict]) – Type of storage service to use for storing hashtables and keys. basename is an optional property whose value will be used as the prefix to stored keys. If this is not set, a random string will be generated instead. If you set this, you will be responsible for ensuring there are no key collisions.
prepickle (Optional[bool]) – If True, all keys are pickled to bytes before insertion. If not specified, a default value is chosen based on the storage_config.
hashfunc (Optional[Callable[[bytes], bytes]]) – If a hash function is provided it will be used to compress the index keys to reduce the memory footprint. This could cause a higher false positive rate.
Note
weights must sum to 1.0, and the format is (false positive weight, false negative weight). For example, if minimizing false negative (or maintaining high recall) is more important, assign more weight toward false negative: weights=(0.4, 0.6). Try to live with a small difference between weights (i.e. < 0.5).
Examples
Create an index with 128 permutation functions optimized for Jaccard threshold 0.9:
lsh = MinHashLSH(threshold=0.9, num_perm=128) print(lsh.b, lsh.r) # 5 25
The built-in optimizer will try to minimize the weighted sum of probabilities of false positive and false negative. The algorithm is a simple grid search over the space of possible parameters.
Note that it is possible to get
b
(number of bands) andr
(band size) that do not sum tonum_perm
, leading to unused permutation values in the indexed MinHash. This is because the optimizer only considers bands of the same size, and the number of bands is not necessarily a divisor ofnum_perm
.Instead of using the built-in optimizer, you can customize the LSH parameters your self. The snippet below creates an index with 128 permutation functions and 16 bands each with size 8, skipping the optimization step:
lsh = MinHashLSH(num_perm=128, params=(16, 8)) print(lsh.b, lsh.r) # 16 8
Create an index backed by Redis storage:
lsh = MinHashLSH(threshold=0.9, num_perm=128, storage_config={ 'type': 'redis', 'basename': b'mylsh', # optional, defaults to a random string. 'redis': {'host': 'localhost', 'port': 6379}, })
The basename property is optional. It is used to generate key prefixes in the storage layer to uniquely identify data associated with this LSH. Thus, if you create a new LSH object with the same basename, you will be using the same underlying data in the storage layer associated with a previous LSH object. If you do not set this property, a random string will be generated instead.
- __init__(threshold: float = 0.9, num_perm: int = 128, weights: Tuple[float, float] = (0.5, 0.5), params: Tuple[int, int] | None = None, storage_config: Dict | None = None, prepickle: bool | None = None, hashfunc: Callable[[bytes], bytes] | None = None) None [source]
- insert(key: Hashable, minhash: MinHash | WeightedMinHash, check_duplication: bool = True)[source]
Insert a key to the index, together with a MinHash or Weighted MinHash of the set referenced by the key.
- Parameters:
key (Hashable) – The unique identifier of the set.
minhash (Union[MinHash, WeightedMinHash]) – The MinHash of the set.
check_duplication (bool) – To avoid duplicate keys in the storage (default=True). It’s recommended to not change the default, but if you want to avoid the overhead during insert you can set check_duplication = False.
- merge(other: MinHashLSH, check_overlap: bool = False)[source]
Merge the other MinHashLSH with this one, making this one the union of both.
Note
Only num_perm, number of bands and sizes of each band is checked for equivalency of two MinHashLSH indexes. Other initialization parameters threshold, weights, storage_config, prepickle and hash_func are not checked.
- Parameters:
other (MinHashLSH) – The other MinHashLSH.
check_overlap (bool) – Check if there are any overlapping keys before merging and raise if there are any. (default=False)
- Raises:
ValueError – If the two MinHashLSH have different initialization parameters, or if check_overlap is True and there are overlapping keys.
- insertion_session(buffer_size: int = 50000) MinHashLSHInsertionSession [source]
Create a context manager for fast insertion into this index.
- Parameters:
buffer_size (int) – The buffer size for insert_session mode (default=50000).
- Returns:
The context manager.
- Return type:
MinHashLSHInsertionSession
Example
Insert 100 MinHashes into an Redis-backed index using a session:
from datasketch import MinHash, MinHashLSH import numpy as np minhashes = [] for i in range(100): m = MinHash(num_perm=128) m.update_batch(np.random.randint(low=0, high=30, size=10)) minhashes.append(m) lsh = MinHashLSH(threshold=0.5, num_perm=128, storage_config={ 'type': 'redis', 'redis': {'host': 'localhost', 'port': 6379}, }) with lsh.insertion_session() as session: for i, m in enumerate(minhashes): session.insert(i, m)
- query(minhash) List[Hashable] [source]
Giving the MinHash of the query set, retrieve the keys that reference sets with Jaccard similarities likely greater than the threshold.
Results are based on minhash segment collision and are thus approximate. For more accurate results, filter again with
MinHash.jaccard()
. For exact results, filter by computing Jaccard similarity using original sets.- Parameters:
minhash (MinHash) – The MinHash of the query set.
- Returns:
a list of unique keys.
- Return type:
list
Example
Query and rank results using
MinHash.jaccard()
.from datasketch import MinHash, MinHashLSH import numpy as np # Generate 100 random MinHashes. minhashes = MinHash.bulk( np.random.randint(low=0, high=30, size=(100, 10)), num_perm=128 ) # Create LSH index. lsh = MinHashLSH(threshold=0.5, num_perm=128) for i, m in enumerate(minhashes): lsh.insert(i, m) # Get the initial results from LSH. query = minhashes[0] results = lsh.query(query) # Rank results using Jaccard similarity estimated by MinHash. results = [(query.jaccard(minhashes[key]), key) for key in results] results.sort(reverse=True) print(results)
Output:
[(1.0, 0), (0.421875, 4), (0.4140625, 19), (0.359375, 58), (0.3359375, 78), (0.265625, 62), (0.2578125, 11), (0.25, 98), (0.171875, 21)]
Note that although the threshold is set to 0.5, the results are not guaranteed to be above 0.5 because the LSH index is approximate and the Jaccard similarity is estimated by MinHash.
- add_to_query_buffer(minhash: MinHash | WeightedMinHash) None [source]
Giving the MinHash of the query set, buffer queries to retrieve the keys that references sets with Jaccard similarities greater than the threshold.
Buffered queries can be executed using
collect_query_buffer()
. The combination of these functions is way faster if cassandra backend is used with shared_buffer.- Parameters:
minhash (MinHash) – The MinHash of the query set.
- collect_query_buffer() List[Hashable] [source]
Execute and return buffered queries given by
add_to_query_buffer()
.If multiple query MinHash were added to the query buffer, the intersection of the results of all query MinHash will be returned.
- Returns:
a list of unique keys.
- Return type:
list
- __contains__(key: Hashable) bool [source]
- Parameters:
key (Hashable) – The unique identifier of a set.
- Returns:
True only if the key exists in the index.
- Return type:
bool
- remove(key: Hashable) None [source]
Remove the key from the index.
- Parameters:
key (Hashable) – The unique identifier of a set.
- Raises:
ValueError – If the key does not exist.
- get_counts() List[Dict[Hashable, int]] [source]
Returns a list of length
b
(i.e., number of hash tables) with each element a dictionary mapping hash table bucket key to the number of indexed keys stored under each bucket.- Returns:
a list of dictionaries.
- Return type:
list
- get_subset_counts(*keys: Hashable) List[Dict[Hashable, int]] [source]
Returns the bucket allocation counts (see
get_counts()
above) restricted to the list of keys given.- Parameters:
keys (Hashable) – the keys for which to get the bucket allocation counts.
- Returns:
a list of dictionaries.
- Return type:
list
- __weakref__
list of weak references to the object
Asynchronous MinHash LSH
MinHash LSH Forest
- class datasketch.MinHashLSHForest(num_perm: int = 128, l: int = 8)[source]
The LSH Forest for MinHash. It supports top-k query in Jaccard similarity. Instead of using prefix trees as the original paper, I use a sorted array to store the hash values in every hash table.
- Parameters:
num_perm (int) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
l (int) – The number of prefix trees as described in the paper.
Note
The MinHash LSH Forest also works with weighted Jaccard similarity and weighted MinHash without modification.
- add(key: Hashable, minhash: MinHash) None [source]
Add a unique key, together with a MinHash (or weighted MinHash) of the set referenced by the key.
Note
The key won’t be searchbale until the
index()
method is called.- Parameters:
key (Hashable) – The unique identifier of the set.
minhash (MinHash) – The MinHash of the set.
- query(minhash: MinHash, k: int) List[Hashable] [source]
Return the approximate top-k keys that have the (approximately) highest Jaccard similarities to the query set.
- Parameters:
minhash (MinHash) – The MinHash of the query set.
k (int) – The maximum number of keys to return.
- Returns:
list of at most k keys.
- Return type:
List[Hashable]
Note
Tip for improving accuracy: you can use a multiple of k (e.g., 2*k) in the argument, compute the exact (or approximate using MinHash) Jaccard similarities of the sets referenced by the returned keys, from which you then take the final top-k. This is often called “post-processing”. Because the total number of similarity computations is still bounded by a constant multiple of k, the performance won’t degrade too much – however you do have to keep the original sets (or MinHashes) around some where so that you can make references to them.
- get_minhash_hashvalues(key: Hashable) ndarray [source]
Returns the hashvalues from the MinHash object that corresponds to the given key in the LSHForest, if it exists. This is useful for when we want to reconstruct the original MinHash object to manually check the Jaccard Similarity for the top-k results from a query.
- Parameters:
key (Hashable) – The key whose MinHash hashvalues we want to retrieve.
- Returns:
The hashvalues for the MinHash object corresponding to the given key.
- Return type:
hashvalues
- is_empty() bool [source]
Check whether there is any searchable keys in the index. Note that keys won’t be searchable until index is called.
- Returns:
True if there is no searchable key in the index.
- Return type:
bool
- __contains__(key: Hashable) bool [source]
- Returns:
True only if the key has been added to the index.
- Return type:
bool
- __weakref__
list of weak references to the object
MinHash LSH Ensemble
- class datasketch.MinHashLSHEnsemble(threshold: float = 0.9, num_perm: int = 128, num_part: int = 16, m: int = 8, weights: Tuple[float, float] = (0.5, 0.5), storage_config: Dict | None = None, prepickle: bool | None = None)[source]
The MinHash LSH Ensemble index. It supports Containment queries. The implementation is based on E. Zhu et al..
- Parameters:
threshold (float) – The Containment threshold between 0.0 and 1.0. The initialized LSH Ensemble will be optimized for the threshold by minizing the false positive and false negative.
num_perm (int) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
num_part (int) – The number of partitions in LSH Ensemble.
m (int) – The memory usage factor: an LSH Ensemble uses approximately m times more memory space than a MinHash LSH with the same number of sets indexed. The higher the m the better the accuracy.
weights (Tuple[float, float]) – Used to adjust the relative importance of minizing false positive and false negative when optimizing for the Containment threshold. Similar to the weights parameter in
datasketch.MinHashLSH
.storage_config (Optional[Dict]) – Type of storage service to use for storing hashtables and keys. basename is an optional property whose value will be used as the prefix to stored keys. If this is not set, a random string will be generated instead. If you set this, you will be responsible for ensuring there are no key collisions.
prepickle (Optional[bool]) – If True, all keys are pickled to bytes before insertion. If None, a default value is chosen based on the storage_config.
Note
Using more partitions (num_part) leads to better accuracy, at the expense of slower query performance. This is different from the paper and the Go implementation, in which more partitions leads to better accuracy AND faster query performance, due to parallelism.
Note
More information about the parameter m can be found in the Go implementation of LSH Ensemble, in which m is named MaxK.
- __init__(threshold: float = 0.9, num_perm: int = 128, num_part: int = 16, m: int = 8, weights: Tuple[float, float] = (0.5, 0.5), storage_config: Dict | None = None, prepickle: bool | None = None) None [source]
- index(entries: Iterable[Tuple[Hashable, MinHash, int]]) None [source]
Index all sets given their keys, MinHashes, and sizes. It can be called only once after the index is created.
- Parameters:
entries (Iterable[Tuple[Hashable, MinHash, int]]) – An iterable of tuples, each must be in the form of
(key, minhash, size)
, wherekey
is the unique identifier of a set,minhash
is the MinHash of the set, andsize
is the size or number of unique items in the set.- Raises:
ValueError – If the index is not empty or
entries
is empty.
- query(minhash: MinHash, size: int) Generator[Hashable, None, None] [source]
Giving the MinHash and size of the query set, retrieve keys that references sets with containment with respect to the query set greater than the threshold.
- Parameters:
minhash (MinHash) – The MinHash of the query set.
size (int) – The size (number of unique items) of the query set.
- Returns:
an iterator of keys.
- Return type:
Generator[Hashable, None, None]
- __contains__(key: Hashable) bool [source]
- Parameters:
key (hashable) – The unique identifier of a set.
- Returns:
True only if the key exists in the index.
- Return type:
bool
- __weakref__
list of weak references to the object
HyperLogLog
- class datasketch.HyperLogLog(p: int = 8, reg: ~numpy.ndarray | None = None, hashfunc: ~typing.Callable = <function sha1_hash32>, hashobj: object | None = None)[source]
The HyperLogLog data sketch for estimating cardinality of very large dataset in a single pass. The original HyperLogLog is described here.
This HyperLogLog implementation is based on: https://github.com/svpcom/hyperloglog
- Parameters:
p (int) – The precision parameter. It is ignored if the reg is given.
reg (Optional[numpy.ndarray]) – The internal state. This argument is for initializing the HyperLogLog from an existing one.
hashfunc (Callable) – The hash function used by this MinHash. It takes the input passed to the update method and returns an integer that can be encoded with 32 bits. The default hash function is based on SHA1 from hashlib.
hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.
- __init__(p: int = 8, reg: ~numpy.ndarray | None = None, hashfunc: ~typing.Callable = <function sha1_hash32>, hashobj: object | None = None)[source]
- update(b) None [source]
Update the HyperLogLog with a new data value in bytes. The value will be hashed using the hash function specified by the hashfunc argument in the constructor.
- Parameters:
b – The value to be hashed using the hash function specified.
Example
To update with a new string value (using the default SHA1 hash function, which requires bytes as input):
hll = HyperLogLog() hll.update("new value".encode('utf-8'))
We can also use a different hash function, for example, pyfarmhash:
import farmhash def _hash_32(b): return farmhash.hash32(b) hll = HyperLogLog(hashfunc=_hash_32) hll.update("new value")
- count() float [source]
Estimate the cardinality of the data values seen so far.
- Returns:
The estimated cardinality.
- Return type:
float
- merge(other: HyperLogLog) None [source]
Merge the other HyperLogLog with this one, making this the union of the two.
- Parameters:
other (HyperLogLog) – The other HyperLogLog to be merged.
- copy() HyperLogLog [source]
Create a copy of the current HyperLogLog by exporting its state.
- Returns:
A copy of the current HyperLogLog.
- Return type:
- is_empty() bool [source]
- Returns:
True if the current HyperLogLog is empty - at the state of just initialized.
- Return type:
bool
- __len__() int [source]
- Returns:
- Get the size of the HyperLogLog as the size of
reg.
- Return type:
int
- __eq__(other: HyperLogLog) bool [source]
Check equivalence between two HyperLogLogs
- Parameters:
other (HyperLogLog) –
- Returns:
True if both have the same internal state.
- Return type:
bool
- __hash__ = None
HyperLogLog++
- class datasketch.HyperLogLogPlusPlus(p: int = 8, reg: ~numpy.ndarray | None = None, hashfunc: ~typing.Callable = <function sha1_hash64>, hashobj: object | None = None)[source]
HyperLogLog++ is an enhanced HyperLogLog from Google. Main changes from the original HyperLogLog:
Use 64 bits instead of 32 bits for hash function
A new small-cardinality estimation scheme
Sparse representation (not implemented here)
- Parameters:
p (int) – The precision parameter. It is ignored if the reg is given.
reg (Optional[numpy.array]) – The internal state. This argument is for initializing the HyperLogLog from an existing one.
hashfunc (Callable) – The hash function used by this MinHash. It takes the input passed to the update method and returns an integer that can be encoded with 64 bits. The default hash function is based on SHA1 from hashlib.
hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.
- __init__(p: int = 8, reg: ~numpy.ndarray | None = None, hashfunc: ~typing.Callable = <function sha1_hash64>, hashobj: object | None = None)[source]
- __weakref__
list of weak references to the object
HNSW
- class datasketch.HNSW(distance_func: Callable[[ndarray, ndarray], float], m: int = 16, ef_construction: int = 200, m0: int | None = None, seed: int | None = None, reversed_edges: bool = False)[source]
Hierarchical Navigable Small World (HNSW) graph index for approximate nearest neighbor search. This implementation is based on the paper “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs” by Yu. A. Malkov, D. A. Yashunin (2016), https://arxiv.org/abs/1603.09320.
- Parameters:
distance_func – A function that takes two vectors and returns a float representing the distance between them.
m (int) – The number of neighbors to keep for each node.
ef_construction (int) – The number of neighbors to consider during construction.
m0 (Optional[int]) – The number of neighbors to keep for each node at the 0th level. If None, defaults to 2 * m.
seed (Optional[int]) – The random seed to use for the random number generator.
reverse_edges (bool) – Whether to maintain reverse edges in the graph. This speeds up hard remove (
remove()
) but increases memory usage and slows downinsert()
.
Examples
Create an HNSW index with Euclidean distance and insert 1000 random vectors of dimension 10.
from datasketch.hnsw import HNSW import numpy as np data = np.random.random_sample((1000, 10)) index = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y)) for i, d in enumerate(data): index.insert(i, d) # Query the index for the 10 nearest neighbors of the first vector. index.query(data[0], k=10)
Create an HNSW index with Jaccard distance and insert 1000 random sets of 10 elements each.
from datasketch.hnsw import HNSW import numpy as np # Each set is represented as a 10-element vector of random integers # between 0 and 100. # Deduplication is handled by the distance function. data = np.random.randint(0, 100, size=(1000, 10)) jaccard_distance = lambda x, y: ( 1.0 - float(len(np.intersect1d(x, y, assume_unique=False))) / float(len(np.union1d(x, y))) ) index = HNSW(distance_func=jaccard_distance) for i, d in enumerate(data): index[i] = d # Query the index for the 10 nearest neighbors of the first set. index.query(data[0], k=10)
- __init__(distance_func: Callable[[ndarray, ndarray], float], m: int = 16, ef_construction: int = 200, m0: int | None = None, seed: int | None = None, reversed_edges: bool = False) None [source]
- __len__() int [source]
Return the number of points in the index excluding those that were soft-removed.
- __contains__(key: Hashable) bool [source]
Return
True
if the index contains the key and it was not soft-removed, elseFalse
.
- __getitem__(key: Hashable) ndarray [source]
Get the point associated with the key. Raises KeyError if the key does not exist in the index or it was soft-removed.
- __setitem__(key: Hashable, value: ndarray) None [source]
Set the point associated with the key and update the index. This is equivalent to calling
insert()
with the key and point.
- __delitem__(key: Hashable) None [source]
Soft remove the point associated with the key. Raises a KeyError if the key does not exist in the index. This is equivalent to calling
remove()
with the key.
- __iter__() Iterator[Hashable] [source]
Return an iterator over the keys of the index that were not soft-removed.
- reversed() Iterator[Hashable] [source]
Return a reverse iterator over the keys of the index that were not soft-removed.
- __eq__(_HNSW__value: object) bool [source]
Return True only if the index parameters, random states, keys, points points, and index structures are equal, including deleted points.
- get(key: Hashable, default: ndarray | None = None) ndarray | None [source]
Return the point for key in the index, else default. If default is not given and key is not in the index or it was soft-removed, return None.
- items() Iterator[Tuple[Hashable, ndarray]] [source]
Return an iterator of the indexed points that were not soft-removed as (key, point) pairs.
- keys() Iterator[Hashable] [source]
Return an iterator of the keys of the index points that were not soft-removed.
- values() Iterator[ndarray] [source]
Return an iterator of the index points that were not soft-removed.
- pop(key: Hashable, default: ndarray | None = None, hard: bool = False) ndarray [source]
If key is in the index, remove it and return its associated point, else return default. If default is not given and key is not in the index or it was soft-removed, raise KeyError.
- popitem(last: bool = True, hard: bool = False) Tuple[Hashable, ndarray] [source]
Remove and return a (key, point) pair from the index. Pairs are returned in LIFO order if last is true or FIFO order if false. If the index is empty or all points are soft-removed, raise KeyError.
Note
In versions of Python before 3.7, the order of items in the index is not guaranteed. This method will remove and return an arbitrary (key, point) pair.
- clear() None [source]
Clear the index of all data points. This will not reset the random number generator.
- copy() HNSW [source]
Create a copy of the index. The copy will have the same parameters as the original index and the same keys and points, but will not share any index data structures (i.e., graphs) with the original index. The new index’s random state will start from a copy of the original index’s.
- update(other: Mapping | HNSW) None [source]
Update the index with the points from the other Mapping or HNSW object, overwriting existing keys.
- Parameters:
other (Union[Mapping, HNSW]) – The other Mapping or HNSW object.
Examples
Create an HNSW index with a dictionary of points.
from datasketch.hnsw import HNSW import numpy as np data = np.random.random_sample((1000, 10)) index = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y)) # Batch insert 1000 points. index.update({i: d for i, d in enumerate(data)})
Create an HNSW index with another HNSW index.
from datasketch.hnsw import HNSW import numpy as np data = np.random.random_sample((1000, 10)) index1 = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y)) index2 = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y)) # Batch insert 1000 points. index1.update({i: d for i, d in enumerate(data)}) # Update index2 with the points from index1. index2.update(index1)
- setdefault(key: Hashable, default: ndarray) ndarray [source]
If key is in the index and it was not soft-removed, return its associated point. If not, insert key with a value of default and return default. default cannot be None.
- insert(key: Hashable, new_point: ndarray, ef: int | None = None, level: int | None = None) None [source]
Add a new point to the index.
- Parameters:
key (Hashable) – The key of the new point. If the key already exists in the index, the point will be updated and the index will be repaired.
new_point (np.ndarray) – The new point to add to the index.
ef (Optional[int]) – The number of neighbors to consider during insertion. If None, use the construction ef.
level (Optional[int]) – The level at which to insert the new point. If None, the level will be chosen automatically.
- query(query_point: ndarray, k: int | None = None, ef: int | None = None) List[Tuple[Hashable, float]] [source]
Search for the k nearest neighbors of the query point.
- Parameters:
query_point (np.ndarray) – The query point.
k (Optional[int]) – The number of neighbors to return. If None, return all neighbors found.
ef (Optional[int]) – The number of neighbors to consider during search. If None, use the construction ef.
- Returns:
- A list of (key, distance) pairs for the k
nearest neighbors of the query point.
- Return type:
List[Tuple[Hashable, float]]
- Raises:
ValueError – If the entry point is not found.
- remove(key: Hashable, hard: bool = False, ef: int | None = None) None [source]
Remove a point from the index. This removal algorithm is based on the discussion on hnswlib issue #4. There are two versions:
soft remove: the point is marked as removed from the index, but its data and out-going edges are kept. Future queries will not return the point and no new edge will direct to this point, but the point will still be used for graph traversal. This is the default behavior.
hard remove: the point is removed from the index and its data and out-going edges are also removed. Points with out-going edges pointing to the deleted point will have their out-going edges re-assigned using the same pruning algorithm as
insert()
during point update.
In both versions, if the deleted point is the current entry point, the entry point will be re-assigned to the next point in the highest layer that has other points beside the current entry point.
Subsequent soft removes without a hard remove of the same point will not affect the index, unless the point was the only point in the index as removing it clears the index. This is different from
pop()
which will always raise a KeyError if the key was removed.Subsequent hard removes of the same point will raise a KeyError. If the point is soft removed and then hard removed, the point will be removed from the index. Use
clean()
for removing all soft removed points from the index.- Parameters:
key (Hashable) – The key of the point to remove.
hard (bool) – If True, perform a hard remove. Otherwise, perform a soft remove.
ef (Optional[int]) – The number of neighbors to consider during re-assignment. If None, use the construction ef. This argument is only used when hard is True.
- Raises:
KeyError – If the index is empty or the key does not exist in the index and was not soft removed.
Example
from datasketch.hnsw import HNSW import numpy as np data = np.random.random_sample((1000, 10)) index = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y)) index.update({i: d for i, d in enumerate(data)}) # Soft remove a point with key = 0. index.remove(0) # Soft remove the same point again will not change the index # because the index is not empty. index.remove(0) print(0 in index) # False # Hard remove the point. index.remove(0, hard=True) # Hard remove the same point again will raise a KeyError. # index.remove(0, hard=True) # Soft remove rest of the points from the index. for i in range(1, 1000): index.remove(i) print(len(index)) # 0 # Clean the index to hard remove all soft removed points. index.clean()
- clean(ef: int | None = None) None [source]
Remove all soft removed points from the index.
- Parameters:
ef (Optional[int]) – The number of neighbors to consider during re-assignment. If None, use the construction ef.
- merge(other: HNSW) HNSW [source]
Create a new index by merging the current index with another index. The new index will contain all points from both indexes. If a point exists in both, the point from the other index will be used. The new index will have the same parameters as the current index and a copy of the current index’s random state.
- Parameters:
other (HNSW) – The other index to merge with.
- Returns:
A new index containing all points from both indexes.
- Return type:
Example
from datasketch.hnsw import HNSW import numpy as np data1 = np.random.random_sample((1000, 10)) data2 = np.random.random_sample((1000, 10)) index1 = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y)) index2 = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y)) # Batch insert data into the indexes. index1.update({i: d for i, d in enumerate(data1)}) index2.update({i + len(data1): d for i, d in enumerate(data2)}) # Merge the indexes. index = index1.merge(index2)
- __hash__ = None
- __weakref__
list of weak references to the object