API Documentation

MinHash

class datasketch.MinHash(num_perm: int = 128, seed: int = 1, hashfunc: ~typing.Callable = <function sha1_hash32>, hashobj: object | None = None, hashvalues: ~typing.Iterable | None = None, permutations: ~typing.Tuple[~typing.Iterable, ~typing.Iterable] | None = None)[source]

MinHash is a probabilistic data structure for computing Jaccard similarity between sets.

Parameters:

num_perm (int) – Number of random permutation functions. It will be ignored if hashvalues is not None.
seed (int) – The random seed controls the set of random permutation functions generated for this MinHash.
hashfunc (Callable) – The hash function used by this MinHash. It takes the input passed to the update() method and returns an integer that can be encoded with 32 bits. The default hash function is based on SHA1 from hashlib. Users can use farmhash for better performance. See the example in update().
hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.
hashvalues (Optional[Iterable]) – The hash values is the internal state of the MinHash. It can be specified for faster initialization using the existing hashvalues of another MinHash.
permutations (Optional[Tuple[Iterable, Iterable]]) – The permutation function parameters as a tuple of two lists. This argument can be specified for faster initialization using the existing permutations from another MinHash.

Note

To save memory usage, consider using datasketch.LeanMinHash.

Note

Since version 1.1.1, MinHash will only support serialization using pickle. serialize and deserialize methods are removed, and are supported in datasketch.LeanMinHash instead. MinHash serialized before version 1.1.1 cannot be deserialized properly in newer versions (need to migrate?).

Note

Since version 1.1.3, MinHash uses Numpy’s random number generator instead of Python’s built-in random package. This change makes the hash values consistent across different Python versions. The side-effect is that now MinHash created before version 1.1.3 won’t work (i.e., jaccard(), merge() and union()) with those created after.

__init__(num_perm: int = 128, seed: int = 1, hashfunc: ~typing.Callable = <function sha1_hash32>, hashobj: object | None = None, hashvalues: ~typing.Iterable | None = None, permutations: ~typing.Tuple[~typing.Iterable, ~typing.Iterable] | None = None) → None[source]

update(b) → None[source]

Update this MinHash with a new value. The value will be hashed using the hash function specified by the hashfunc argument in the constructor.

Parameters:: b – The value to be hashed using the hash function specified.

Example

To update with a new string value (using the default SHA1 hash function, which requires bytes as input):

minhash = Minhash()
minhash.update("new value".encode('utf-8'))

We can also use a different hash function, for example, pyfarmhash:

import farmhash
def _hash_32(b):
    return farmhash.hash32(b)
minhash = MinHash(hashfunc=_hash_32)
minhash.update("new value")

update_batch(b: Iterable) → None[source]

Update this MinHash with new values. The values will be hashed using the hash function specified by the hashfunc argument in the constructor.

Parameters:: b (Iterable) – Values to be hashed using the hash function specified.

Example

To update with new string values (using the default SHA1 hash function, which requires bytes as input):

minhash = Minhash()
minhash.update_batch([s.encode('utf-8') for s in ["token1", "token2"]])

jaccard(other: MinHash) → float[source]

Estimate the Jaccard similarity (resemblance) between the sets represented by this MinHash and the other.

Parameters:: other (MinHash) – The other MinHash.
Returns:: The Jaccard similarity, which is between 0.0 and 1.0.
Return type:: float
Raises:: ValueError – If the two MinHashes have different numbers of permutation functions or different seeds.

count() → float[source]

Estimate the cardinality count based on the technique described in this paper.

Returns:: The estimated cardinality of the set represented by this MinHash.
Return type:: int

merge(other: MinHash) → None[source]

Merge the other MinHash with this one, making this one the union of both.

Parameters:: other (MinHash) – The other MinHash.
Raises:: ValueError – If the two MinHashes have different numbers of permutation functions or different seeds.

digest() → ndarray[source]

Export the hash values, which is the internal state of the MinHash.

Returns:: The hash values which is a Numpy array.
Return type:: numpy.ndarray

is_empty() → bool[source]

Returns:

If the current MinHash is empty - at the state of just: initialized.

Return type:

bool

clear() → None[source]: Clear the current state of the MinHash. All hash values are reset.

copy() → MinHash[source]

Returns:: a copy of this MinHash by exporting its state.
Return type:: MinHash

__len__() → int[source]

Returns:: The number of hash values.
Return type:: int

__eq__(other: MinHash) → bool[source]

Returns:: If their seeds and hash values are both equal then two are equivalent.
Return type:: bool

classmethod union(*mhs: MinHash) → MinHash[source]

Create a MinHash which is the union of the MinHash objects passed as arguments.

Parameters:: *mhs (MinHash) – The MinHash objects to be united. The argument list length is variable, but must be at least 2.
Returns:: a new union MinHash.
Return type:: MinHash
Raises:: ValueError – If the number of MinHash objects passed as arguments is less than 2, or if the MinHash objects passed as arguments have different seeds or different numbers of permutation functions.

Example

from datasketch import MinHash
import numpy as np

m1 = MinHash(num_perm=128)
m1.update_batch(np.random.randint(low=0, high=30, size=10))

m2 = MinHash(num_perm=128)
m2.update_batch(np.random.randint(low=0, high=30, size=10))

# Union m1 and m2.
m = MinHash.union(m1, m2)

__hash__ = None

__weakref__: list of weak references to the object

classmethod bulk(b: Iterable, **minhash_kwargs) → List[MinHash][source]

Compute MinHashes in bulk. This method avoids unnecessary overhead when initializing many minhashes by reusing the initialized state.

Parameters:

b (Iterable) – An Iterable of lists of bytes, each list is hashed in to one MinHash in the output.
**minhash_kwargs – Keyword arguments used to initialize MinHash, will be used for all minhashes.

Returns:

A list of computed MinHashes.

Return type:

List[datasketch.MinHash]

Example

from datasketch import MinHash
data = [[b'token1', b'token2', b'token3'],
        [b'token4', b'token5', b'token6']]
minhashes = MinHash.bulk(data, num_perm=64)

classmethod generator(b: Iterable, **minhash_kwargs) → Generator[MinHash, None, None][source]

Compute MinHashes in a generator. This method avoids unnecessary overhead when initializing many minhashes by reusing the initialized state.

Parameters:

b (Iterable) – An Iterable of lists of bytes, each list is hashed in to one MinHash in the output.
minhash_kwargs – Keyword arguments used to initialize MinHash, will be used for all minhashes.

Returns:

a generator of computed MinHashes.

Return type:

Generator[MinHash, None, None]

Example

from datasketch import MinHash
data = [[b'token1', b'token2', b'token3'],
        [b'token4', b'token5', b'token6']]
for minhash in MinHash.generator(data, num_perm=64):
    # do something useful
    minhash

Lean MinHash

class datasketch.LeanMinHash(minhash: MinHash = None, seed: int = None, hashvalues: Iterable = None)[source]

Lean MinHash is MinHash with a smaller memory footprint and faster deserialization, but with its internal state frozen – no update().

Lean MinHash inherits all methods from datasketch.MinHash. It does not store the permutations and the hashfunc needed for updating. If a MinHash does not need further updates, convert it into a lean MinHash to save memory.

Example

To create a lean MinHash from an existing MinHash:

lean_minhash = LeanMinHash(minhash)

# You can compute the Jaccard similarity between two lean MinHash
lean_minhash.jaccard(lean_minhash2)

# Or between a lean MinHash and a MinHash
lean_minhash.jaccard(minhash2)

To create a lean MinHash from the hash values and seed of an existing MinHash:

lean_minhash = LeanMinHash(seed=minhash.seed,
                           hashvalues=minhash.hashvalues)

To create a MinHash from a lean MinHash:

minhash = MinHash(seed=lean_minhash.seed,
                  hashvalues=lean_minhash.hashvalues)

# Or if you want to prevent further updates on minhash
# from affecting the state of lean_minhash
minhash = MinHash(seed=lean_minhash.seed,
                  hashvalues=lean_minhash.digest())

Note

Lean MinHash can also be used in datasketch.MinHashLSH, datasketch.MinHashLSHForest, and datasketch.MinHashLSHEnsemble.

Parameters:

minhash (optional) – The datasketch.MinHash object used to initialize the LeanMinHash. If this is not set, then seed and hashvalues must be set.
seed (optional) – The random seed that controls the set of random permutation functions generated for this LeanMinHash. This parameter must be used together with hashvalues.
hashvalues (optional) – The hash values used to inititialize the state of the LeanMinHash. This parameter must be used together with seed.

__init__(minhash: MinHash = None, seed: int = None, hashvalues: Iterable = None)[source]

update(b) → None[source]: This method is not available on a LeanMinHash. Calling it raises a TypeError.

copy() → LeanMinHash[source]

Returns:: a copy of this MinHash by exporting its state.
Return type:: MinHash

bytesize(byteorder='@') → int[source]

Compute the byte size after serialization.

Parameters:: byteorder (str, optional) – This is byte order of the serialized data. Use one of the byte order characters: @, =, <, >, and !. Default is @ – the native order.
Returns:: Size in number of bytes after serialization.
Return type:: int

serialize(buf, byteorder='@') → None[source]

Serialize this lean MinHash and store the result in an allocated buffer.

Parameters:

buf (buffer) – buf must implement the buffer interface. One such example is the built-in bytearray class.
byteorder (str, optional) –
This is byte order of the serialized data. Use one of the byte order characters: @, =, <, >, and !. Default is @ – the native order.

This is preferred over using pickle if the serialized lean MinHash needs to be used by another program in a different programming language.

The serialization schema:

The first 8 bytes is the seed integer
The next 4 bytes is the number of hash values
The rest is the serialized hash values, each uses 4 bytes

Example

To serialize a single lean MinHash into a bytearray buffer.

buf = bytearray(lean_minhash.bytesize())
lean_minhash.serialize(buf)

To serialize multiple lean MinHash into a bytearray buffer.

# assuming lean_minhashs is a list of LeanMinHash with the same size
size = lean_minhashs[0].bytesize()
buf = bytearray(size*len(lean_minhashs))
for i, lean_minhash in enumerate(lean_minhashs):
    lean_minhash.serialize(buf[i*size:])

classmethod deserialize(buf, byteorder='@') → LeanMinHash[source]

Deserialize a lean MinHash from a buffer.

Parameters:

buf (buffer) – buf must implement the buffer interface. One such example is the built-in bytearray class.
byteorder (str. optional) –
This is byte order of the serialized data. Use one of the byte order characters: @, =, <, >, and !. Default is @ – the native order.

Returns:

The deserialized lean MinHash

Return type:

datasketch.LeanMinHash

Example

To deserialize a lean MinHash from a buffer.

lean_minhash = LeanMinHash.deserialize(buf)

__getstate__()[source]: Helper for pickle.

__hash__() → int[source]: Return hash(self).

classmethod union(*lmhs: LeanMinHash) → LeanMinHash[source]: Create a new lean MinHash by unioning multiple lean MinHash.

Weighted MinHash

class datasketch.WeightedMinHashGenerator(dim: int, sample_size: int = 128, seed: int = 1)[source]

The weighted MinHash generator is used for creating new WeightedMinHash objects.

This weighted MinHash implementation is based on Sergey Ioffe’s paper, Improved Consistent Sampling, Weighted Minhash and L1 Sketching

Parameters:

dim (int) – The number of dimensions of the input Jaccard vectors.
sample_size (int) – The number of samples to use for creating weighted MinHash.
seed (int) – The random seed to use for generating permutation functions.

__init__(dim: int, sample_size: int = 128, seed: int = 1) → None[source]

minhash(v: ndarray) → WeightedMinHash[source]

Create a new weighted MinHash given a weighted Jaccard vector. Each dimension is an integer frequency of the corresponding element in the multi-set represented by the vector.

Parameters:: v (numpy.ndarray) – The Jaccard vector.
Returns:: The weighted MinHash.
Return type:: WeightedMinHash

minhash_many(X: spmatrix | ndarray) → List[WeightedMinHash | None][source]

Create new WeightedMinHash instances given a matrix of weighted Jaccard vectors. In the input matrix X, each row corresponds to a multi-set, and each column stores the integer frequency of the element of a dimension.

Note

This method is experimental and does not yield the same MinHash hash values as minhash().

Parameters:: X (Union[scipy.sparse.spmatrix, numpy.ndarray]) – A matrix of Jaccard vectors (rows).
Returns:: List[Union[WeightedMinHash, None]] - A list of length X.shape[0]. Each element is either a WeightedMinHash instance or None (if the original row in X is empty).

__weakref__: list of weak references to the object

class datasketch.WeightedMinHash(seed: int, hashvalues: ndarray)[source]

New weighted MinHash is generated by WeightedMinHashGenerator. You can also initialize a weighted MinHash by using the state from an existing one.

Parameters:

seed (int) – The random seed used to generate this weighted MinHash.
hashvalues (numpy.ndarray) – The internal state of this weighted MinHash.

__init__(seed: int, hashvalues: ndarray) → None[source]

jaccard(other: WeightedMinHash) → float[source]

Estimate the weighted Jaccard similarity between the multi-sets represented by this weighted MinHash and the other.

Parameters:: other (WeightedMinHash) – The other weighted MinHash.
Returns:: The weighted Jaccard similarity between 0.0 and 1.0.
Return type:: float
Raises:: ValueError – If the two weighted MinHash objects have different seeds or different numbers of hash values.

digest() → ndarray[source]

Export the hash values, which is the internal state of the weighted MinHash.

Returns:: The hash values which is a Numpy array.
Return type:: numpy.ndarray

copy() → WeightedMinHash[source]

Returns:: A copy of this weighted MinHash by exporting its state.
Return type:: WeightedMinHash

__len__() → int[source]

Returns:: The number of hash values.
Return type:: int

__eq__(other) → bool[source]

Returns:: If their seeds and hash values are both equal then two are equivalent.
Return type:: bool

__hash__ = None

__weakref__: list of weak references to the object

MinHash LSH

class datasketch.MinHashLSH(threshold: float = 0.9, num_perm: int = 128, weights: Tuple[float, float] = (0.5, 0.5), params: Tuple[int, int] | None = None, storage_config: Dict | None = None, prepickle: bool | None = None, hashfunc: Callable[[bytes], bytes] | None = None)[source]

The MinHash LSH index. It supports query with Jaccard similarity threshold. Reference: Chapter 3, Mining of Massive Datasets.

Parameters:

threshold (float) – The Jaccard similarity threshold between 0.0 and 1.0. The initialized MinHash LSH will be optimized for the threshold by minizing the false positive and false negative.
num_perm (int) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
weights (Tuple[float, float]) – Used to adjust the relative importance of minimizing false positive and false negative when optimizing for the Jaccard similarity threshold. weights is a tuple in the format of (false_positive_weight, false_negative_weight).
params (Optiona[Tuple[int, int]]) – The LSH parameters (i.e., number of bands and size of each bands). This is used to bypass the parameter optimization step in the constructor. threshold and weights will be ignored if this is given.
storage_config (Optional[Dict]) – Type of storage service to use for storing hashtables and keys. basename is an optional property whose value will be used as the prefix to stored keys. If this is not set, a random string will be generated instead. If you set this, you will be responsible for ensuring there are no key collisions.
prepickle (Optional[bool]) – If True, all keys are pickled to bytes before insertion. If not specified, a default value is chosen based on the storage_config.
hashfunc (Optional[Callable[[bytes], bytes]]) – If a hash function is provided it will be used to compress the index keys to reduce the memory footprint. This could cause a higher false positive rate.

Note

weights must sum to 1.0, and the format is (false positive weight, false negative weight). For example, if minimizing false negative (or maintaining high recall) is more important, assign more weight toward false negative: weights=(0.4, 0.6). Try to live with a small difference between weights (i.e. < 0.5).

Examples

Create an index with 128 permutation functions optimized for Jaccard threshold 0.9:

lsh = MinHashLSH(threshold=0.9, num_perm=128)
print(lsh.b, lsh.r)
# 5 25

The built-in optimizer will try to minimize the weighted sum of probabilities of false positive and false negative. The algorithm is a simple grid search over the space of possible parameters.

Note that it is possible to get b (number of bands) and r (band size) that do not sum to num_perm, leading to unused permutation values in the indexed MinHash. This is because the optimizer only considers bands of the same size, and the number of bands is not necessarily a divisor of num_perm.

Instead of using the built-in optimizer, you can customize the LSH parameters your self. The snippet below creates an index with 128 permutation functions and 16 bands each with size 8, skipping the optimization step:

lsh = MinHashLSH(num_perm=128, params=(16, 8))
print(lsh.b, lsh.r)
# 16 8

Create an index backed by Redis storage:

lsh = MinHashLSH(threshold=0.9, num_perm=128, storage_config={
    'type': 'redis',
    'basename': b'mylsh', # optional, defaults to a random string.
    'redis': {'host': 'localhost', 'port': 6379},
})

The basename property is optional. It is used to generate key prefixes in the storage layer to uniquely identify data associated with this LSH. Thus, if you create a new LSH object with the same basename, you will be using the same underlying data in the storage layer associated with a previous LSH object. If you do not set this property, a random string will be generated instead.

__init__(threshold: float = 0.9, num_perm: int = 128, weights: Tuple[float, float] = (0.5, 0.5), params: Tuple[int, int] | None = None, storage_config: Dict | None = None, prepickle: bool | None = None, hashfunc: Callable[[bytes], bytes] | None = None) → None[source]

insert(key: Hashable, minhash: MinHash | WeightedMinHash, check_duplication: bool = True)[source]

Insert a key to the index, together with a MinHash or Weighted MinHash of the set referenced by the key.

Parameters:

key (Hashable) – The unique identifier of the set.
minhash (Union[MinHash, WeightedMinHash]) – The MinHash of the set.
check_duplication (bool) – To avoid duplicate keys in the storage (default=True). It’s recommended to not change the default, but if you want to avoid the overhead during insert you can set check_duplication = False.

merge(other: MinHashLSH, check_overlap: bool = False)[source]

Merge the other MinHashLSH with this one, making this one the union of both.

Note

Only num_perm, number of bands and sizes of each band is checked for equivalency of two MinHashLSH indexes. Other initialization parameters threshold, weights, storage_config, prepickle and hash_func are not checked.

Parameters:

other (MinHashLSH) – The other MinHashLSH.
check_overlap (bool) – Check if there are any overlapping keys before merging and raise if there are any. (default=False)

Raises:

ValueError – If the two MinHashLSH have different initialization parameters, or if check_overlap is True and there are overlapping keys.

insertion_session(buffer_size: int = 50000) → MinHashLSHInsertionSession[source]

Create a context manager for fast insertion into this index.

Parameters:: buffer_size (int) – The buffer size for insert_session mode (default=50000).
Returns:: The context manager.
Return type:: MinHashLSHInsertionSession

Example

Insert 100 MinHashes into an Redis-backed index using a session:

from datasketch import MinHash, MinHashLSH
import numpy as np

minhashes = []
for i in range(100):
    m = MinHash(num_perm=128)
    m.update_batch(np.random.randint(low=0, high=30, size=10))
    minhashes.append(m)

lsh = MinHashLSH(threshold=0.5, num_perm=128, storage_config={
    'type': 'redis',
    'redis': {'host': 'localhost', 'port': 6379},
})
with lsh.insertion_session() as session:
    for i, m in enumerate(minhashes):
        session.insert(i, m)

query(minhash) → List[Hashable][source]

Giving the MinHash of the query set, retrieve the keys that reference sets with Jaccard similarities likely greater than the threshold.

Results are based on minhash segment collision and are thus approximate. For more accurate results, filter again with MinHash.jaccard(). For exact results, filter by computing Jaccard similarity using original sets.

Parameters:: minhash (MinHash) – The MinHash of the query set.
Returns:: a list of unique keys.
Return type:: list

Example

Query and rank results using MinHash.jaccard().

from datasketch import MinHash, MinHashLSH
import numpy as np

# Generate 100 random MinHashes.
minhashes = MinHash.bulk(
    np.random.randint(low=0, high=30, size=(100, 10)),
    num_perm=128
)

# Create LSH index.
lsh = MinHashLSH(threshold=0.5, num_perm=128)
for i, m in enumerate(minhashes):
    lsh.insert(i, m)

# Get the initial results from LSH.
query = minhashes[0]
results = lsh.query(query)

# Rank results using Jaccard similarity estimated by MinHash.
results = [(query.jaccard(minhashes[key]), key) for key in results]
results.sort(reverse=True)
print(results)

Output:

[(1.0, 0), (0.421875, 4), (0.4140625, 19), (0.359375, 58), (0.3359375, 78), (0.265625, 62), (0.2578125, 11), (0.25, 98), (0.171875, 21)]

Note that although the threshold is set to 0.5, the results are not guaranteed to be above 0.5 because the LSH index is approximate and the Jaccard similarity is estimated by MinHash.

add_to_query_buffer(minhash: MinHash | WeightedMinHash) → None[source]

Giving the MinHash of the query set, buffer queries to retrieve the keys that references sets with Jaccard similarities greater than the threshold.

Buffered queries can be executed using collect_query_buffer(). The combination of these functions is way faster if cassandra backend is used with shared_buffer.

Parameters:: minhash (MinHash) – The MinHash of the query set.

collect_query_buffer() → List[Hashable][source]

Execute and return buffered queries given by add_to_query_buffer().

If multiple query MinHash were added to the query buffer, the intersection of the results of all query MinHash will be returned.

Returns:: a list of unique keys.
Return type:: list

__contains__(key: Hashable) → bool[source]

Parameters:: key (Hashable) – The unique identifier of a set.
Returns:: True only if the key exists in the index.
Return type:: bool

remove(key: Hashable) → None[source]

Remove the key from the index.

Parameters:: key (Hashable) – The unique identifier of a set.
Raises:: ValueError – If the key does not exist.

is_empty() → bool[source]

Returns:: True only if the index is empty.
Return type:: bool

get_counts() → List[Dict[Hashable, int]][source]

Returns a list of length b (i.e., number of hash tables) with each element a dictionary mapping hash table bucket key to the number of indexed keys stored under each bucket.

Returns:: a list of dictionaries.
Return type:: list

get_subset_counts(*keys: Hashable) → List[Dict[Hashable, int]][source]

Returns the bucket allocation counts (see get_counts() above) restricted to the list of keys given.

Parameters:: keys (Hashable) – the keys for which to get the bucket allocation counts.
Returns:: a list of dictionaries.
Return type:: list

__weakref__: list of weak references to the object

Asynchronous MinHash LSH

MinHash LSH Forest

class datasketch.MinHashLSHForest(num_perm: int = 128, l: int = 8)[source]

The LSH Forest for MinHash. It supports top-k query in Jaccard similarity. Instead of using prefix trees as the original paper, I use a sorted array to store the hash values in every hash table.

Parameters:

num_perm (int) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
l (int) – The number of prefix trees as described in the paper.

Note

The MinHash LSH Forest also works with weighted Jaccard similarity and weighted MinHash without modification.

__init__(num_perm: int = 128, l: int = 8) → None[source]

add(key: Hashable, minhash: MinHash) → None[source]

Add a unique key, together with a MinHash (or weighted MinHash) of the set referenced by the key.

Note

The key won’t be searchbale until the index() method is called.

Parameters:

key (Hashable) – The unique identifier of the set.
minhash (MinHash) – The MinHash of the set.

index() → None[source]: Index all the keys added so far and make them searchable.

query(minhash: MinHash, k: int) → List[Hashable][source]

Return the approximate top-k keys that have the (approximately) highest Jaccard similarities to the query set.

Parameters:

minhash (MinHash) – The MinHash of the query set.
k (int) – The maximum number of keys to return.

Returns:

list of at most k keys.

Return type:

List[Hashable]

Note

Tip for improving accuracy: you can use a multiple of k (e.g., 2*k) in the argument, compute the exact (or approximate using MinHash) Jaccard similarities of the sets referenced by the returned keys, from which you then take the final top-k. This is often called “post-processing”. Because the total number of similarity computations is still bounded by a constant multiple of k, the performance won’t degrade too much – however you do have to keep the original sets (or MinHashes) around some where so that you can make references to them.

get_minhash_hashvalues(key: Hashable) → ndarray[source]

Returns the hashvalues from the MinHash object that corresponds to the given key in the LSHForest, if it exists. This is useful for when we want to reconstruct the original MinHash object to manually check the Jaccard Similarity for the top-k results from a query.

Parameters:: key (Hashable) – The key whose MinHash hashvalues we want to retrieve.
Returns:: The hashvalues for the MinHash object corresponding to the given key.
Return type:: hashvalues

is_empty() → bool[source]

Check whether there is any searchable keys in the index. Note that keys won’t be searchable until index is called.

Returns:: True if there is no searchable key in the index.
Return type:: bool

__contains__(key: Hashable) → bool[source]

Returns:: True only if the key has been added to the index.
Return type:: bool

__weakref__: list of weak references to the object

MinHash LSH Ensemble

class datasketch.MinHashLSHEnsemble(threshold: float = 0.9, num_perm: int = 128, num_part: int = 16, m: int = 8, weights: Tuple[float, float] = (0.5, 0.5), storage_config: Dict | None = None, prepickle: bool | None = None)[source]

The MinHash LSH Ensemble index. It supports Containment queries. The implementation is based on E. Zhu et al..

Parameters:

threshold (float) – The Containment threshold between 0.0 and 1.0. The initialized LSH Ensemble will be optimized for the threshold by minizing the false positive and false negative.
num_perm (int) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
num_part (int) – The number of partitions in LSH Ensemble.
m (int) – The memory usage factor: an LSH Ensemble uses approximately m times more memory space than a MinHash LSH with the same number of sets indexed. The higher the m the better the accuracy.
weights (Tuple[float, float]) – Used to adjust the relative importance of minizing false positive and false negative when optimizing for the Containment threshold. Similar to the weights parameter in datasketch.MinHashLSH.
storage_config (Optional[Dict]) – Type of storage service to use for storing hashtables and keys. basename is an optional property whose value will be used as the prefix to stored keys. If this is not set, a random string will be generated instead. If you set this, you will be responsible for ensuring there are no key collisions.
prepickle (Optional[bool]) – If True, all keys are pickled to bytes before insertion. If None, a default value is chosen based on the storage_config.

Note

Using more partitions (num_part) leads to better accuracy, at the expense of slower query performance. This is different from the paper and the Go implementation, in which more partitions leads to better accuracy AND faster query performance, due to parallelism.

Note

More information about the parameter m can be found in the Go implementation of LSH Ensemble, in which m is named MaxK.

__init__(threshold: float = 0.9, num_perm: int = 128, num_part: int = 16, m: int = 8, weights: Tuple[float, float] = (0.5, 0.5), storage_config: Dict | None = None, prepickle: bool | None = None) → None[source]

index(entries: Iterable[Tuple[Hashable, MinHash, int]]) → None[source]

Index all sets given their keys, MinHashes, and sizes. It can be called only once after the index is created.

Parameters:: entries (Iterable[Tuple[Hashable, MinHash, int]]) – An iterable of tuples, each must be in the form of (key, minhash, size), where key is the unique identifier of a set, minhash is the MinHash of the set, and size is the size or number of unique items in the set.
Raises:: ValueError – If the index is not empty or entries is empty.

query(minhash: MinHash, size: int) → Generator[Hashable, None, None][source]

Giving the MinHash and size of the query set, retrieve keys that references sets with containment with respect to the query set greater than the threshold.

Parameters:

minhash (MinHash) – The MinHash of the query set.
size (int) – The size (number of unique items) of the query set.

Returns:

an iterator of keys.

Return type:

Generator[Hashable, None, None]

__contains__(key: Hashable) → bool[source]

Parameters:: key (hashable) – The unique identifier of a set.
Returns:: True only if the key exists in the index.
Return type:: bool

is_empty() → bool[source]

Returns:: Check if the index is empty.
Return type:: bool

__weakref__: list of weak references to the object

HyperLogLog

class datasketch.HyperLogLog(p: int = 8, reg: ~numpy.ndarray | None = None, hashfunc: ~typing.Callable = <function sha1_hash32>, hashobj: object | None = None)[source]

The HyperLogLog data sketch for estimating cardinality of very large dataset in a single pass. The original HyperLogLog is described here.

This HyperLogLog implementation is based on: https://github.com/svpcom/hyperloglog

Parameters:

p (int) – The precision parameter. It is ignored if the reg is given.
reg (Optional[numpy.ndarray]) – The internal state. This argument is for initializing the HyperLogLog from an existing one.
hashfunc (Callable) – The hash function used by this MinHash. It takes the input passed to the update method and returns an integer that can be encoded with 32 bits. The default hash function is based on SHA1 from hashlib.
hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.

__init__(p: int = 8, reg: ~numpy.ndarray | None = None, hashfunc: ~typing.Callable = <function sha1_hash32>, hashobj: object | None = None)[source]

update(b) → None[source]

Update the HyperLogLog with a new data value in bytes. The value will be hashed using the hash function specified by the hashfunc argument in the constructor.

Parameters:: b – The value to be hashed using the hash function specified.

Example

To update with a new string value (using the default SHA1 hash function, which requires bytes as input):

hll = HyperLogLog()
hll.update("new value".encode('utf-8'))

We can also use a different hash function, for example, pyfarmhash:

import farmhash
def _hash_32(b):
    return farmhash.hash32(b)
hll = HyperLogLog(hashfunc=_hash_32)
hll.update("new value")

count() → float[source]

Estimate the cardinality of the data values seen so far.

Returns:: The estimated cardinality.
Return type:: float

merge(other: HyperLogLog) → None[source]

Merge the other HyperLogLog with this one, making this the union of the two.

Parameters:: other (HyperLogLog) – The other HyperLogLog to be merged.

digest() → ndarray[source]

Returns:: The current internal state.
Return type:: numpy.array

copy() → HyperLogLog[source]

Create a copy of the current HyperLogLog by exporting its state.

Returns:: A copy of the current HyperLogLog.
Return type:: HyperLogLog

is_empty() → bool[source]

Returns:: True if the current HyperLogLog is empty - at the state of just initialized.
Return type:: bool

clear() → None[source]: Reset the current HyperLogLog to empty.

__len__() → int[source]

Returns:

Get the size of the HyperLogLog as the size of: reg.

Return type:

int

__eq__(other: HyperLogLog) → bool[source]

Check equivalence between two HyperLogLogs

Parameters:: other (HyperLogLog) –
Returns:: True if both have the same internal state.
Return type:: bool

bytesize() → int[source]: Get the size of the HyperLogLog in bytes.

__getstate__()[source]: Helper for pickle.

__hash__ = None

HyperLogLog++

class datasketch.HyperLogLogPlusPlus(p: int = 8, reg: ~numpy.ndarray | None = None, hashfunc: ~typing.Callable = <function sha1_hash64>, hashobj: object | None = None)[source]

HyperLogLog++ is an enhanced HyperLogLog from Google. Main changes from the original HyperLogLog:

Use 64 bits instead of 32 bits for hash function
A new small-cardinality estimation scheme
Sparse representation (not implemented here)

Parameters:

p (int) – The precision parameter. It is ignored if the reg is given.
reg (Optional[numpy.array]) – The internal state. This argument is for initializing the HyperLogLog from an existing one.
hashfunc (Callable) – The hash function used by this MinHash. It takes the input passed to the update method and returns an integer that can be encoded with 64 bits. The default hash function is based on SHA1 from hashlib.
hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.

__init__(p: int = 8, reg: ~numpy.ndarray | None = None, hashfunc: ~typing.Callable = <function sha1_hash64>, hashobj: object | None = None)[source]

__weakref__: list of weak references to the object

count() → float[source]: Estimate the cardinality of the data values seen so far.

HNSW

class datasketch.HNSW(distance_func: Callable[[ndarray, ndarray], float], m: int = 16, ef_construction: int = 200, m0: int | None = None, seed: int | None = None, reversed_edges: bool = False)[source]

Hierarchical Navigable Small World (HNSW) graph index for approximate nearest neighbor search. This implementation is based on the paper “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs” by Yu. A. Malkov, D. A. Yashunin (2016), https://arxiv.org/abs/1603.09320.

Parameters:

distance_func – A function that takes two vectors and returns a float representing the distance between them.
m (int) – The number of neighbors to keep for each node.
ef_construction (int) – The number of neighbors to consider during construction.
m0 (Optional[int]) – The number of neighbors to keep for each node at the 0th level. If None, defaults to 2 * m.
seed (Optional[int]) – The random seed to use for the random number generator.
reverse_edges (bool) – Whether to maintain reverse edges in the graph. This speeds up hard remove (remove()) but increases memory usage and slows down insert().

Examples

Create an HNSW index with Euclidean distance and insert 1000 random vectors of dimension 10.

from datasketch.hnsw import HNSW
import numpy as np

data = np.random.random_sample((1000, 10))
index = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y))
for i, d in enumerate(data):
    index.insert(i, d)

# Query the index for the 10 nearest neighbors of the first vector.
index.query(data[0], k=10)

Create an HNSW index with Jaccard distance and insert 1000 random sets of 10 elements each.

from datasketch.hnsw import HNSW
import numpy as np

# Each set is represented as a 10-element vector of random integers
# between 0 and 100.
# Deduplication is handled by the distance function.
data = np.random.randint(0, 100, size=(1000, 10))
jaccard_distance = lambda x, y: (
    1.0 - float(len(np.intersect1d(x, y, assume_unique=False)))
    / float(len(np.union1d(x, y)))
)
index = HNSW(distance_func=jaccard_distance)
for i, d in enumerate(data):
    index[i] = d

# Query the index for the 10 nearest neighbors of the first set.
index.query(data[0], k=10)

__init__(distance_func: Callable[[ndarray, ndarray], float], m: int = 16, ef_construction: int = 200, m0: int | None = None, seed: int | None = None, reversed_edges: bool = False) → None[source]

__len__() → int[source]: Return the number of points in the index excluding those that were soft-removed.

__contains__(key: Hashable) → bool[source]: Return True if the index contains the key and it was not soft-removed, else False.

__getitem__(key: Hashable) → ndarray[source]: Get the point associated with the key. Raises KeyError if the key does not exist in the index or it was soft-removed.

__setitem__(key: Hashable, value: ndarray) → None[source]: Set the point associated with the key and update the index. This is equivalent to calling insert() with the key and point.

__delitem__(key: Hashable) → None[source]: Soft remove the point associated with the key. Raises a KeyError if the key does not exist in the index. This is equivalent to calling remove() with the key.

__iter__() → Iterator[Hashable][source]: Return an iterator over the keys of the index that were not soft-removed.

reversed() → Iterator[Hashable][source]: Return a reverse iterator over the keys of the index that were not soft-removed.

__eq__(_HNSW__value: object) → bool[source]: Return True only if the index parameters, random states, keys, points points, and index structures are equal, including deleted points.

get(key: Hashable, default: ndarray | None = None) → ndarray | None[source]: Return the point for key in the index, else default. If default is not given and key is not in the index or it was soft-removed, return None.

items() → Iterator[Tuple[Hashable, ndarray]][source]: Return an iterator of the indexed points that were not soft-removed as (key, point) pairs.

keys() → Iterator[Hashable][source]: Return an iterator of the keys of the index points that were not soft-removed.

values() → Iterator[ndarray][source]: Return an iterator of the index points that were not soft-removed.

pop(key: Hashable, default: ndarray | None = None, hard: bool = False) → ndarray[source]: If key is in the index, remove it and return its associated point, else return default. If default is not given and key is not in the index or it was soft-removed, raise KeyError.

popitem(last: bool = True, hard: bool = False) → Tuple[Hashable, ndarray][source]: Remove and return a (key, point) pair from the index. Pairs are returned in LIFO order if last is true or FIFO order if false. If the index is empty or all points are soft-removed, raise KeyError.

Note

In versions of Python before 3.7, the order of items in the index is not guaranteed. This method will remove and return an arbitrary (key, point) pair.

clear() → None[source]: Clear the index of all data points. This will not reset the random number generator.

copy() → HNSW[source]: Create a copy of the index. The copy will have the same parameters as the original index and the same keys and points, but will not share any index data structures (i.e., graphs) with the original index. The new index’s random state will start from a copy of the original index’s.

update(other: Mapping | HNSW) → None[source]

Update the index with the points from the other Mapping or HNSW object, overwriting existing keys.

Parameters:: other (Union[Mapping, HNSW]) – The other Mapping or HNSW object.

Examples

Create an HNSW index with a dictionary of points.

from datasketch.hnsw import HNSW
import numpy as np

data = np.random.random_sample((1000, 10))
index = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y))

# Batch insert 1000 points.
index.update({i: d for i, d in enumerate(data)})

Create an HNSW index with another HNSW index.

from datasketch.hnsw import HNSW
import numpy as np

data = np.random.random_sample((1000, 10))
index1 = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y))
index2 = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y))

# Batch insert 1000 points.
index1.update({i: d for i, d in enumerate(data)})

# Update index2 with the points from index1.
index2.update(index1)

setdefault(key: Hashable, default: ndarray) → ndarray[source]: If key is in the index and it was not soft-removed, return its associated point. If not, insert key with a value of default and return default. default cannot be None.

insert(key: Hashable, new_point: ndarray, ef: int | None = None, level: int | None = None) → None[source]

Add a new point to the index.

Parameters:

key (Hashable) – The key of the new point. If the key already exists in the index, the point will be updated and the index will be repaired.
new_point (np.ndarray) – The new point to add to the index.
ef (Optional[int]) – The number of neighbors to consider during insertion. If None, use the construction ef.
level (Optional[int]) – The level at which to insert the new point. If None, the level will be chosen automatically.

query(query_point: ndarray, k: int | None = None, ef: int | None = None) → List[Tuple[Hashable, float]][source]

Search for the k nearest neighbors of the query point.

Parameters:

query_point (np.ndarray) – The query point.
k (Optional[int]) – The number of neighbors to return. If None, return all neighbors found.
ef (Optional[int]) – The number of neighbors to consider during search. If None, use the construction ef.

Returns:

A list of (key, distance) pairs for the k: nearest neighbors of the query point.

Return type:

List[Tuple[Hashable, float]]

Raises:

ValueError – If the entry point is not found.

remove(key: Hashable, hard: bool = False, ef: int | None = None) → None[source]

Remove a point from the index. This removal algorithm is based on the discussion on hnswlib issue #4. There are two versions:

soft remove: the point is marked as removed from the index, but its data and out-going edges are kept. Future queries will not return the point and no new edge will direct to this point, but the point will still be used for graph traversal. This is the default behavior.
hard remove: the point is removed from the index and its data and out-going edges are also removed. Points with out-going edges pointing to the deleted point will have their out-going edges re-assigned using the same pruning algorithm as insert() during point update.

In both versions, if the deleted point is the current entry point, the entry point will be re-assigned to the next point in the highest layer that has other points beside the current entry point.

Subsequent soft removes without a hard remove of the same point will not affect the index, unless the point was the only point in the index as removing it clears the index. This is different from pop() which will always raise a KeyError if the key was removed.

Subsequent hard removes of the same point will raise a KeyError. If the point is soft removed and then hard removed, the point will be removed from the index. Use clean() for removing all soft removed points from the index.

Parameters:

key (Hashable) – The key of the point to remove.
hard (bool) – If True, perform a hard remove. Otherwise, perform a soft remove.
ef (Optional[int]) – The number of neighbors to consider during re-assignment. If None, use the construction ef. This argument is only used when hard is True.

Raises:

KeyError – If the index is empty or the key does not exist in the index and was not soft removed.

Example

from datasketch.hnsw import HNSW
import numpy as np

data = np.random.random_sample((1000, 10))
index = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y))
index.update({i: d for i, d in enumerate(data)})

# Soft remove a point with key = 0.
index.remove(0)

# Soft remove the same point again will not change the index
# because the index is not empty.
index.remove(0)

print(0 in index)  # False

# Hard remove the point.
index.remove(0, hard=True)

# Hard remove the same point again will raise a KeyError.
# index.remove(0, hard=True)

# Soft remove rest of the points from the index.
for i in range(1, 1000):
    index.remove(i)

print(len(index))  # 0

# Clean the index to hard remove all soft removed points.
index.clean()

clean(ef: int | None = None) → None[source]

Remove all soft removed points from the index.

Parameters:: ef (Optional[int]) – The number of neighbors to consider during re-assignment. If None, use the construction ef.

merge(other: HNSW) → HNSW[source]

Create a new index by merging the current index with another index. The new index will contain all points from both indexes. If a point exists in both, the point from the other index will be used. The new index will have the same parameters as the current index and a copy of the current index’s random state.

Parameters:: other (HNSW) – The other index to merge with.
Returns:: A new index containing all points from both indexes.
Return type:: HNSW

Example

from datasketch.hnsw import HNSW
import numpy as np

data1 = np.random.random_sample((1000, 10))
data2 = np.random.random_sample((1000, 10))
index1 = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y))
index2 = HNSW(distance_func=lambda x, y: np.linalg.norm(x - y))

# Batch insert data into the indexes.
index1.update({i: d for i, d in enumerate(data1)})
index2.update({i + len(data1): d for i, d in enumerate(data2)})

# Merge the indexes.
index = index1.merge(index2)

__hash__ = None

__weakref__: list of weak references to the object