HyperLogLog is capable of estimating the cardinality (the number of distinct values) of dataset in a single pass, using a small and fixed memory space. HyperLogLog is first introduced in this paper by Philippe Flajolet, Éric Fusy, Olivier Gandouet and Frédéric Meunier.
from datasketch import HyperLogLog data1 = ['hyperloglog', 'is', 'a', 'probabilistic', 'data', 'structure', 'for', 'estimating', 'the', 'cardinality', 'of', 'dataset', 'dataset', 'a'] h = HyperLogLog() for d in data1: h.update(d.encode('utf8')) print("Estimated cardinality is", h.count()) s1 = set(data1) print("Actual cardinality is", len(s1))
As in MinHash, you can also control the accuracy of HyperLogLog by changing the parameter p.
# This will give better accuracy than the default setting (8). h = HyperLogLog(p=12)
Interestingly, there is no speed penalty for using higher p value. However the memory usage is exponential to the p value.
As in MinHash, you can also merge two HyperLogLogs to create a union HyperLogLog.
h1 = HyperLogLog() h2 = HyperLogLog() h1.update('test'.encode('utf8')) # The makes h1 the union of h2 and the original h1. h1.merge(h2) # This will return the cardinality of the union h1.count()
HyperLogLog++ is an enhanced version of HyperLogLog by Google with the following changes: * Use 64-bit hash values instead of the 32-bit used by HyperLogLog * A more stable bias correction scheme based on experiments on many datasets * Sparse representation (not implemented here)
HyperLogLog++ object shares the same interface as HyperLogLog. So you can use all the HyperLogLog functions in HyperLogLog++.
from datasketch import HyperLogLogPlusPlus # Initialize an HyperLogLog++ object. hpp = HyperLogLogPlusPlus() # Everything else is the same as HyperLogLog