API Documentation

MinHash

class datasketch.MinHash(num_perm=128, seed=1, hashfunc=<function sha1_hash32>, hashobj=None, hashvalues=None, permutations=None)[source]

MinHash is a probabilistic data structure for computing Jaccard similarity between sets.

Parameters
  • num_perm (int, optional) – Number of random permutation functions. It will be ignored if hashvalues is not None.

  • seed (int, optional) – The random seed controls the set of random permutation functions generated for this MinHash.

  • hashfunc (optional) – The hash function used by this MinHash. It takes the input passed to the update method and returns an integer that can be encoded with 32 bits. The default hash function is based on SHA1 from hashlib.

  • hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.

  • hashvalues (numpy.array or list, optional) – The hash values is the internal state of the MinHash. It can be specified for faster initialization using the existing state from another MinHash.

  • permutations (optional) – The permutation function parameters. This argument can be specified for faster initialization using the existing state from another MinHash.

Note

To save memory usage, consider using datasketch.LeanMinHash.

Note

Since version 1.1.1, MinHash will only support serialization using pickle. serialize and deserialize methods are removed, and are supported in datasketch.LeanMinHash instead. MinHash serialized before version 1.1.1 cannot be deserialized properly in newer versions (need to migrate?).

Note

Since version 1.1.3, MinHash uses Numpy’s random number generator instead of Python’s built-in random package. This change makes the hash values consistent across different Python versions. The side-effect is that now MinHash created before version 1.1.3 won’t work (i.e., jaccard, merge and union) with those created after.

__init__(num_perm=128, seed=1, hashfunc=<function sha1_hash32>, hashobj=None, hashvalues=None, permutations=None)[source]

Initialize self. See help(type(self)) for accurate signature.

update(b)[source]

Update this MinHash with a new value. The value will be hashed using the hash function specified by the hashfunc argument in the constructor.

Parameters

b – The value to be hashed using the hash function specified.

Example

To update with a new string value (using the default SHA1 hash function, which requires bytes as input):

minhash = Minhash()
minhash.update("new value".encode('utf-8'))

We can also use a different hash function, for example, pyfarmhash:

import farmhash
def _hash_32(b):
    return farmhash.hash32(b)
minhash = MinHash(hashfunc=_hash_32)
minhash.update("new value")
update_batch(b)[source]

Update this MinHash with new values. The values will be hashed using the hash function specified by the hashfunc argument in the constructor.

Parameters

b (list) – List of values to be hashed using the hash function specified.

Example

To update with new string values (using the default SHA1 hash function, which requires bytes as input):

minhash = Minhash()
minhash.update_batch([s.encode('utf-8') for s in ["token1", "token2"]])
jaccard(other)[source]

Estimate the Jaccard similarity (resemblance) between the sets represented by this MinHash and the other.

Parameters

other (datasketch.MinHash) – The other MinHash.

Returns

The Jaccard similarity, which is between 0.0 and 1.0.

Return type

float

count()[source]

Estimate the cardinality count based on the technique described in this paper.

Returns

The estimated cardinality of the set represented by this MinHash.

Return type

int

merge(other)[source]

Merge the other MinHash with this one, making this one the union of both.

Parameters

other (datasketch.MinHash) – The other MinHash.

digest()[source]

Export the hash values, which is the internal state of the MinHash.

Returns

The hash values which is a Numpy array.

Return type

numpy.array

is_empty()[source]
Returns

If the current MinHash is empty - at the state of just

initialized.

Return type

bool

clear()[source]

Clear the current state of the MinHash. All hash values are reset.

copy()[source]
Returns

datasketch.MinHash – A copy of this MinHash by exporting its state.

__len__()[source]
Returns

int – The number of hash values.

__eq__(other)[source]
Returns

bool – If their seeds and hash values are both equal then two are equivalent.

classmethod union(*mhs)[source]

Create a MinHash which is the union of the MinHash objects passed as arguments.

Parameters

*mhs – The MinHash objects to be united. The argument list length is variable, but must be at least 2.

Returns

A new union MinHash.

Return type

datasketch.MinHash

__weakref__

list of weak references to the object (if defined)

classmethod bulk(b, **minhash_kwargs)[source]

Compute MinHashes in bulk. This method avoids unnecessary overhead when initializing many minhashes by reusing the initialized state.

Parameters
  • b (Iterable) – An Iterable of lists of bytes, each list is hashed in to one MinHash in the output.

  • minhash_kwargs – Keyword arguments used to initialize MinHash, will be used for all minhashes.

Returns

A list of computed MinHashes.

Return type

List[datasketch.MinHash]

Example

from datasketch import MinHash
data = [[b'token1', b'token2', b'token3'],
        [b'token4', b'token5', b'token6']]
minhashes = MinHash.bulk(data, num_perm=64)
classmethod generator(b, **minhash_kwargs)[source]

Compute MinHashes in a generator. This method avoids unnecessary overhead when initializing many minhashes by reusing the initialized state.

Parameters
  • b (Iterable) – An Iterable of lists of bytes, each list is hashed in to one MinHash in the output.

  • minhash_kwargs – Keyword arguments used to initialize MinHash, will be used for all minhashes.

Returns

A generator of computed MinHashes.

Example

from datasketch import MinHash
data = [[b'token1', b'token2', b'token3'],
        [b'token4', b'token5', b'token6']]
for minhash in MinHash.generator(data, num_perm=64):
    # do something useful
    minhash

Lean MinHash

class datasketch.LeanMinHash(minhash=None, seed=None, hashvalues=None)[source]

Lean MinHash is MinHash with a smaller memory footprint and faster deserialization, but with its internal state frozen – no update().

Lean MinHash inherits all methods from datasketch.MinHash. It does not store the permutations and the hashfunc needed for updating. If a MinHash does not need further updates, convert it into a lean MinHash to save memory.

Example

To create a lean MinHash from an existing MinHash:

lean_minhash = LeanMinHash(minhash)

# You can compute the Jaccard similarity between two lean MinHash
lean_minhash.jaccard(lean_minhash2)

# Or between a lean MinHash and a MinHash
lean_minhash.jaccard(minhash2)

To create a lean MinHash from the hash values and seed of an existing MinHash:

lean_minhash = LeanMinHash(seed=minhash.seed,
                           hashvalues=minhash.hashvalues)

To create a MinHash from a lean MinHash:

minhash = MinHash(seed=lean_minhash.seed,
                  hashvalues=lean_minhash.hashvalues)

# Or if you want to prevent further updates on minhash
# from affecting the state of lean_minhash
minhash = MinHash(seed=lean_minhash.seed,
                  hashvalues=lean_minhash.digest())
Parameters
  • minhash (optional) – The datasketch.MinHash object used to initialize the LeanMinHash. If this is not set, then seed and hashvalues must be set.

  • seed (optional) – The random seed that controls the set of random permutation functions generated for this LeanMinHash. This parameter must be used together with hashvalues.

  • hashvalues (optional) – The hash values used to inititialize the state of the LeanMinHash. This parameter must be used together with seed.

__init__(minhash=None, seed=None, hashvalues=None)[source]

Initialize self. See help(type(self)) for accurate signature.

update(b)[source]

This method is not available on a LeanMinHash. Calling it raises a TypeError.

copy()[source]
Returns

datasketch.MinHash – A copy of this MinHash by exporting its state.

bytesize(byteorder='@')[source]

Compute the byte size after serialization.

Parameters

byteorder (str, optional) – This is byte order of the serialized data. Use one of the byte order characters: @, =, <, >, and !. Default is @ – the native order.

Returns

Size in number of bytes after serialization.

Return type

int

serialize(buf, byteorder='@')[source]

Serialize this lean MinHash and store the result in an allocated buffer.

Parameters
  • buf (buffer) – buf must implement the buffer interface. One such example is the built-in bytearray class.

  • byteorder (str, optional) –

    This is byte order of the serialized data. Use one of the byte order characters: @, =, <, >, and !. Default is @ – the native order.

This is preferred over using pickle if the serialized lean MinHash needs to be used by another program in a different programming language.

The serialization schema:
  1. The first 8 bytes is the seed integer

  2. The next 4 bytes is the number of hash values

  3. The rest is the serialized hash values, each uses 4 bytes

Example

To serialize a single lean MinHash into a bytearray buffer.

buf = bytearray(lean_minhash.bytesize())
lean_minhash.serialize(buf)

To serialize multiple lean MinHash into a bytearray buffer.

# assuming lean_minhashs is a list of LeanMinHash with the same size
size = lean_minhashs[0].bytesize()
buf = bytearray(size*len(lean_minhashs))
for i, lean_minhash in enumerate(lean_minhashs):
    lean_minhash.serialize(buf[i*size:])
classmethod deserialize(buf, byteorder='@')[source]

Deserialize a lean MinHash from a buffer.

Parameters
  • buf (buffer) – buf must implement the buffer interface. One such example is the built-in bytearray class.

  • byteorder (str. optional) –

    This is byte order of the serialized data. Use one of the byte order characters: @, =, <, >, and !. Default is @ – the native order.

Returns

The deserialized lean MinHash

Return type

datasketch.LeanMinHash

Example

To deserialize a lean MinHash from a buffer.

lean_minhash = LeanMinHash.deserialize(buf)
__hash__()[source]

Return hash(self).

classmethod union(*lmhs)[source]

Create a MinHash which is the union of the MinHash objects passed as arguments.

Parameters

*mhs – The MinHash objects to be united. The argument list length is variable, but must be at least 2.

Returns

A new union MinHash.

Return type

datasketch.MinHash

Weighted MinHash

class datasketch.WeightedMinHashGenerator(dim, sample_size=128, seed=1)[source]

The weighted MinHash generator is used for creating new datasketch.WeightedMinHash objects.

This weighted MinHash implementation is based on Sergey Ioffe’s paper, Improved Consistent Sampling, Weighted Minhash and L1 Sketching

Parameters
  • dim (int) – The number of dimensions of the input Jaccard vectors.

  • sample_size (int, optional) – The number of samples to use for creating weighted MinHash.

  • seed (int) – The random seed to use for generating permutation functions.

__init__(dim, sample_size=128, seed=1)[source]

Initialize self. See help(type(self)) for accurate signature.

minhash(v)[source]

Create a new weighted MinHash given a weighted Jaccard vector. Each dimension is an integer frequency of the corresponding element in the multi-set represented by the vector.

Parameters

v (numpy.array) – The Jaccard vector.

__weakref__

list of weak references to the object (if defined)

class datasketch.WeightedMinHash(seed, hashvalues)[source]

New weighted MinHash is generated by datasketch.WeightedMinHashGenerator. You can also initialize weighted MinHash by using the state from an existing one.

Parameters
  • seed (int) – The random seed used to generate this weighted MinHash.

  • hashvalues – The internal state of this weighted MinHash.

__init__(seed, hashvalues)[source]

Initialize self. See help(type(self)) for accurate signature.

jaccard(other)[source]

Estimate the weighted Jaccard similarity between the multi-sets represented by this weighted MinHash and the other.

Parameters

other (datasketch.WeightedMinHash) – The other weighted MinHash.

Returns

The weighted Jaccard similarity between 0.0 and 1.0.

Return type

float

digest()[source]

Export the hash values, which is the internal state of the weighted MinHash.

Returns

The hash values which is a Numpy array.

Return type

numpy.array

copy()[source]
Returns

A copy of this weighted MinHash by exporting its state.

Return type

datasketch.WeightedMinHash

__len__()[source]
Returns

The number of hash values.

Return type

int

__eq__(other)[source]
Returns

If their seeds and hash values are both equal then two are equivalent.

Return type

bool

__weakref__

list of weak references to the object (if defined)

MinHash LSH

class datasketch.MinHashLSH(threshold=0.9, num_perm=128, weights=(0.5, 0.5), params=None, storage_config=None, prepickle=None, hashfunc=None)[source]

The MinHash LSH index. It supports query with Jaccard similarity threshold. Reference: Chapter 3, Mining of Massive Datasets.

Parameters
  • threshold (float) – The Jaccard similarity threshold between 0.0 and 1.0. The initialized MinHash LSH will be optimized for the threshold by minizing the false positive and false negative.

  • num_perm (int, optional) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).

  • weights (tuple, optional) – Used to adjust the relative importance of minimizing false positive and false negative when optimizing for the Jaccard similarity threshold. weights is a tuple in the format of (false_positive_weight, false_negative_weight).

  • params (tuple, optional) – The LSH parameters (i.e., number of bands and size of each bands). This is used to bypass the parameter optimization step in the constructor. threshold and weights will be ignored if this is given.

  • storage_config (dict, optional) – Type of storage service to use for storing hashtables and keys. basename is an optional property whose value will be used as the prefix to stored keys. If this is not set, a random string will be generated instead. If you set this, you will be responsible for ensuring there are no key collisions.

  • prepickle (bool, optional) – If True, all keys are pickled to bytes before insertion. If None, a default value is chosen based on the storage_config.

  • hashfunc (function, optional) – If a hash function is provided it will be used to compress the index keys to reduce the memory footprint. This could cause a higher false positive rate.

Note

weights must sum to 1.0, and the format is (false positive weight, false negative weight). For example, if minimizing false negative (or maintaining high recall) is more important, assign more weight toward false negative: weights=(0.4, 0.6). Try to live with a small difference between weights (i.e. < 0.5).

__init__(threshold=0.9, num_perm=128, weights=(0.5, 0.5), params=None, storage_config=None, prepickle=None, hashfunc=None)[source]

Initialize self. See help(type(self)) for accurate signature.

insert(key, minhash, check_duplication=True)[source]

Insert a key to the index, together with a MinHash (or weighted MinHash) of the set referenced by the key.

Parameters
  • key (str) – The identifier of the set.

  • minhash (datasketch.MinHash) – The MinHash of the set.

  • check_duplication (bool) – To avoid duplicate keys in the storage (default=True). It’s recommended to not change the default, but if you want to avoid the overhead during insert you can set check_duplication = False.

insertion_session(buffer_size=50000)[source]

Create a context manager for fast insertion into this index.

Parameters

buffer_size (int) – The buffer size for insert_session mode (default=50000).

Returns

datasketch.lsh.MinHashLSHInsertionSession

query(minhash)[source]

Giving the MinHash of the query set, retrieve the keys that references sets with Jaccard similarities greater than the threshold.

Parameters

minhash (datasketch.MinHash) – The MinHash of the query set.

Returns

list of unique keys.

__contains__(key)[source]
Parameters

key (hashable) – The unique identifier of a set.

Returns

True only if the key exists in the index.

Return type

bool

remove(key)[source]

Remove the key from the index.

Parameters

key (hashable) – The unique identifier of a set.

is_empty()[source]
Returns

Check if the index is empty.

Return type

bool

get_counts()[source]

Returns a list of length self.b with elements representing the number of keys stored under each bucket for the given permutation.

get_subset_counts(*keys)[source]

Returns the bucket allocation counts (see get_counts() above) restricted to the list of keys given.

Parameters

keys (hashable) – the keys for which to get the bucket allocation counts

__weakref__

list of weak references to the object (if defined)

Asynchronous MinHash LSH

class datasketch.experimental.aio.lsh.AsyncMinHashLSH(threshold: float = 0.9, num_perm: int = 128, weights: Tuple[float, float] = (0.5, 0.5), params: Optional[Tuple[int, int]] = None, storage_config: Optional[Dict] = None)[source]

Asynchronous MinHashLSH index.

Parameters

For example usage see Asynchronous MinHash LSH at scale.

Example of supported storage configuration:

MONGO = {'type': 'aiomongo', 'basename': 'base_name_1', 'mongo': {'host': 'localhost', 'port': 27017}}

Note

__init__(threshold: float = 0.9, num_perm: int = 128, weights: Tuple[float, float] = (0.5, 0.5), params: Optional[Tuple[int, int]] = None, storage_config: Optional[Dict] = None)[source]

Initialize self. See help(type(self)) for accurate signature.

async close()[source]

Cleanup client resources and disconnect from AsyncMinHashLSH storage.

async insert(key, minhash, check_duplication=True)[source]

see datasketch.MinHashLSH.

insertion_session(batch_size=10000)[source]

Create a asynchronous context manager for fast insertion in index.

Parameters

batch_size (int) – the size of chunks to use in insert_session mode (default=10000).

Returns

datasketch.experimental.aio.lsh.AsyncMinHashLSHSession

Example

from datasketch.experimental.aio.lsh import AsyncMinHashLSH
from datasketch import MinHash

def chunk(it, size):
    it = iter(it)
    return iter(lambda: tuple(islice(it, size)), ())

_chunked_str = chunk((random.choice(string.ascii_lowercase) for _ in range(10000)), 4)
seq = frozenset(chain((''.join(s) for s in _chunked_str), ('aahhb', 'aahh', 'aahhc', 'aac', 'kld', 'bhg', 'kkd', 'yow', 'ppi', 'eer')))
objs = [MinHash(16) for _ in range(len(seq))]
for e, obj in zip(seq, objs):
    for i in e:
        obj.update(i.encode('utf-8'))
data = [(e, m) for e, m in zip(seq, objs)]

_storage_config_redis = {'type': 'aiomongo', 'mongo': {'host': 'localhost', 'port': 27017}}
async def func():
    async with AsyncMinHashLSH(storage_config=_storage_config_redis, threshold=0.5, num_perm=16) as lsh:
        async with lsh.insertion_session(batch_size=1000) as session:
            fs = (session.insert(key, minhash, check_duplication=True) for key, minhash in data)
            await asyncio.gather(*fs)
delete_session(batch_size=10000)[source]

Create a asynchronous context manager for fast removal of keys from index.

Parameters

batch_size (int) – the size of chunks to use in insert_session mode (default=10000).

Returns

datasketch.experimental.aio.lsh.AsyncMinHashLSHSession

Example

from datasketch.experimental.aio.lsh import AsyncMinHashLSH
from datasketch import MinHash

def chunk(it, size):
    it = iter(it)
    return iter(lambda: tuple(islice(it, size)), ())

_chunked_str = chunk((random.choice(string.ascii_lowercase) for _ in range(10000)), 4)
seq = frozenset(chain((''.join(s) for s in _chunked_str), ('aahhb', 'aahh', 'aahhc', 'aac', 'kld', 'bhg', 'kkd', 'yow', 'ppi', 'eer')))
objs = [MinHash(16) for _ in range(len(seq))]
for e, obj in zip(seq, objs):
    for i in e:
        obj.update(i.encode('utf-8'))
data = [(e, m) for e, m in zip(seq, objs)]

_storage_config_redis = {'type': 'aiomongo', 'mongo': {'host': 'localhost', 'port': 27017}}
async def func():
    async with AsyncMinHashLSH(storage_config=_storage_config_redis, threshold=0.5, num_perm=16) as lsh:
        async with lsh.insertion_session(batch_size=1000) as session:
            fs = (session.insert(key, minhash, check_duplication=True) for key, minhash in data)
            await asyncio.gather(*fs)

        async with lsh.delete_session(batch_size=3) as session:
            fs = (session.remove(key) for key in keys_to_remove)
            await asyncio.gather(*fs)
async query(minhash)[source]

see datasketch.MinHashLSH.

async has_key(key)[source]

see datasketch.MinHashLSH.

async remove(key)[source]

see datasketch.MinHashLSH.

async is_empty()[source]

see datasketch.MinHashLSH.

async get_counts()[source]

see datasketch.MinHashLSH.

async get_subset_counts(*keys)[source]

see datasketch.MinHashLSH.

__weakref__

list of weak references to the object (if defined)

MinHash LSH Forest

class datasketch.MinHashLSHForest(num_perm=128, l=8)[source]

The LSH Forest for MinHash. It supports top-k query in Jaccard similarity. Instead of using prefix trees as the original paper, I use a sorted array to store the hash values in every hash table.

Parameters
  • num_perm (int, optional) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).

  • l (int, optional) – The number of prefix trees as described in the paper.

Note

The MinHash LSH Forest also works with weighted Jaccard similarity and weighted MinHash without modification.

__init__(num_perm=128, l=8)[source]

Initialize self. See help(type(self)) for accurate signature.

add(key, minhash)[source]

Add a unique key, together with a MinHash (or weighted MinHash) of the set referenced by the key.

Note

The key won’t be searchbale until the datasketch.MinHashLSHForest.index() method is called.

Parameters
  • key (hashable) – The unique identifier of the set.

  • minhash (datasketch.MinHash) – The MinHash of the set.

index()[source]

Index all the keys added so far and make them searchable.

query(minhash, k)[source]

Return the approximate top-k keys that have the (approximately) highest Jaccard similarities to the query set.

Parameters
  • minhash (datasketch.MinHash) – The MinHash of the query set.

  • k (int) – The maximum number of keys to return.

Returns

list of at most k keys.

Note

Tip for improving accuracy: you can use a multiple of k (e.g., 2*k) in the argument, compute the exact (or approximate using MinHash) Jaccard similarities of the sets referenced by the returned keys, from which you then take the final top-k. This is often called “post-processing”. Because the total number of similarity computations is still bounded by a constant multiple of k, the performance won’t degrade too much – however you do have to keep the original sets (or MinHashes) around some where so that you can make references to them.

is_empty()[source]

Check whether there is any searchable keys in the index. Note that keys won’t be searchable until index is called.

Returns

True if there is no searchable key in the index.

Return type

bool

__contains__(key)[source]
Returns

True only if the key has been added to the index.

Return type

bool

__weakref__

list of weak references to the object (if defined)

MinHash LSH Ensemble

class datasketch.MinHashLSHEnsemble(threshold=0.9, num_perm=128, num_part=16, m=8, weights=(0.5, 0.5), storage_config=None, prepickle=None)[source]

The MinHash LSH Ensemble index. It supports Containment queries. The implementation is based on E. Zhu et al..

Parameters
  • threshold (float) – The Containment threshold between 0.0 and 1.0. The initialized LSH Ensemble will be optimized for the threshold by minizing the false positive and false negative.

  • num_perm (int, optional) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).

  • num_part (int, optional) – The number of partitions in LSH Ensemble.

  • m (int, optional) – The memory usage factor: an LSH Ensemble uses approximately m times more memory space than a MinHash LSH with the same number of sets indexed. The higher the m the better the accuracy.

  • weights (tuple, optional) – Used to adjust the relative importance of minizing false positive and false negative when optimizing for the Containment threshold. Similar to the weights parameter in datasketch.MinHashLSH.

  • storage_config (dict, optional) – Type of storage service to use for storing hashtables and keys. basename is an optional property whose value will be used as the prefix to stored keys. If this is not set, a random string will be generated instead. If you set this, you will be responsible for ensuring there are no key collisions.

  • prepickle (bool, optional) – If True, all keys are pickled to bytes before insertion. If None, a default value is chosen based on the storage_config.

Note

Using more partitions (num_part) leads to better accuracy, at the expense of slower query performance. This is different from the paper and the Go implementation, in which more partitions leads to better accuracy AND faster query performance, due to parallelism.

Note

More information about the parameter m can be found in the Go implementation of LSH Ensemble, in which m is named MaxK.

__init__(threshold=0.9, num_perm=128, num_part=16, m=8, weights=(0.5, 0.5), storage_config=None, prepickle=None)[source]

Initialize self. See help(type(self)) for accurate signature.

index(entries)[source]

Index all sets given their keys, MinHashes, and sizes. It can be called only once after the index is created.

Parameters

entries (iterable of tuple) – An iterable of tuples, each must be in the form of (key, minhash, size), where key is the unique identifier of a set, minhash is the MinHash of the set, and size is the size or number of unique items in the set.

Note

size must be positive.

query(minhash, size)[source]

Giving the MinHash and size of the query set, retrieve keys that references sets with containment with respect to the query set greater than the threshold.

Parameters
  • minhash (datasketch.MinHash) – The MinHash of the query set.

  • size (int) – The size (number of unique items) of the query set.

Returns

iterator of keys.

__contains__(key)[source]
Parameters

key (hashable) – The unique identifier of a set.

Returns

True only if the key exists in the index.

Return type

bool

is_empty()[source]
Returns

Check if the index is empty.

Return type

bool

__weakref__

list of weak references to the object (if defined)

HyperLogLog

class datasketch.HyperLogLog(p=8, reg=None, hashfunc=<function sha1_hash32>, hashobj=None)[source]

The HyperLogLog data sketch for estimating cardinality of very large dataset in a single pass. The original HyperLogLog is described here.

This HyperLogLog implementation is based on: https://github.com/svpcom/hyperloglog

Parameters
  • p (int, optional) – The precision parameter. It is ignored if the reg is given.

  • reg (numpy.array, optional) – The internal state. This argument is for initializing the HyperLogLog from an existing one.

  • hashfunc (optional) – The hash function used by this MinHash. It takes the input passed to the update method and returns an integer that can be encoded with 32 bits. The default hash function is based on SHA1 from hashlib.

  • hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.

__init__(p=8, reg=None, hashfunc=<function sha1_hash32>, hashobj=None)[source]

Initialize self. See help(type(self)) for accurate signature.

update(b)[source]

Update the HyperLogLog with a new data value in bytes. The value will be hashed using the hash function specified by the hashfunc argument in the constructor.

Parameters

b – The value to be hashed using the hash function specified.

Example

To update with a new string value (using the default SHA1 hash function, which requires bytes as input):

hll = HyperLogLog()
hll.update("new value".encode('utf-8'))

We can also use a different hash function, for example, pyfarmhash:

import farmhash
def _hash_32(b):
    return farmhash.hash32(b)
hll = HyperLogLog(hashfunc=_hash_32)
hll.update("new value")
count()[source]

Estimate the cardinality of the data values seen so far.

Returns

The estimated cardinality.

Return type

int

merge(other)[source]

Merge the other HyperLogLog with this one, making this the union of the two.

Parameters

other (datasketch.HyperLogLog) –

digest()[source]
Returns

The current internal state.

Return type

numpy.array

copy()[source]

Create a copy of the current HyperLogLog by exporting its state.

Returns

Return type

datasketch.HyperLogLog

is_empty()[source]
Returns

True if the current HyperLogLog is empty - at the state of just initialized.

Return type

bool

clear()[source]

Reset the current HyperLogLog to empty.

__len__()[source]
Returns

Get the size of the HyperLogLog as the size of

reg.

Return type

int

__eq__(other)[source]

Check equivalence between two HyperLogLogs

Parameters

other (datasketch.HyperLogLog) –

Returns

True if both have the same internal state.

Return type

bool

HyperLogLog++

class datasketch.HyperLogLogPlusPlus(p=8, reg=None, hashfunc=<function sha1_hash64>, hashobj=None)[source]

HyperLogLog++ is an enhanced HyperLogLog from Google. Main changes from the original HyperLogLog:

  1. Use 64 bits instead of 32 bits for hash function

  2. A new small-cardinality estimation scheme

  3. Sparse representation (not implemented here)

Parameters
  • p (int, optional) – The precision parameter. It is ignored if the reg is given.

  • reg (numpy.array, optional) – The internal state. This argument is for initializing the HyperLogLog from an existing one.

  • hashfunc (optional) – The hash function used by this MinHash. It takes the input passed to the update method and returns an integer that can be encoded with 64 bits. The default hash function is based on SHA1 from hashlib.

  • hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.

__init__(p=8, reg=None, hashfunc=<function sha1_hash64>, hashobj=None)[source]

Initialize self. See help(type(self)) for accurate signature.

__weakref__

list of weak references to the object (if defined)

count()[source]

Estimate the cardinality of the data values seen so far.

Returns

The estimated cardinality.

Return type

int