.. _hyperloglog: HyperLogLog =========== HyperLogLog is capable of estimating the cardinality (the number of distinct values) of dataset in a single pass, using a small and fixed memory space. HyperLogLog is first introduced in this `paper `__ by Philippe Flajolet, Éric Fusy, Olivier Gandouet and Frédéric Meunier. .. code:: python from datasketch import HyperLogLog data1 = ['hyperloglog', 'is', 'a', 'probabilistic', 'data', 'structure', 'for', 'estimating', 'the', 'cardinality', 'of', 'dataset', 'dataset', 'a'] h = HyperLogLog() for d in data1: h.update(d.encode('utf8')) print("Estimated cardinality is", h.count()) s1 = set(data1) print("Actual cardinality is", len(s1)) As in MinHash, you can also control the accuracy of HyperLogLog by changing the parameter p. .. code:: python # This will give better accuracy than the default setting (8). h = HyperLogLog(p=12) Interestingly, there is no speed penalty for using higher p value. However the memory usage is exponential to the p value. .. figure:: /_static/hyperloglog_benchmark.png :alt: HyperLogLog Benchmark As in MinHash, you can also merge two HyperLogLogs to create a union HyperLogLog. .. code:: python h1 = HyperLogLog() h2 = HyperLogLog() h1.update('test'.encode('utf8')) # The makes h1 the union of h2 and the original h1. h1.merge(h2) # This will return the cardinality of the union h1.count() .. _hyperloglog_plusplus: HyperLogLog++ ============= `HyperLogLog++ `__ is an enhanced version of HyperLogLog by Google with the following changes: \* Use 64-bit hash values instead of the 32-bit used by HyperLogLog \* A more stable bias correction scheme based on experiments on many datasets \* Sparse representation (not implemented here) HyperLogLog++ object shares the same interface as HyperLogLog. So you can use all the HyperLogLog functions in HyperLogLog++. .. code:: python from datasketch import HyperLogLogPlusPlus # Initialize an HyperLogLog++ object. hpp = HyperLogLogPlusPlus() # Everything else is the same as HyperLogLog