feat: Optimize approx_count_distinct and set up benchmarking #3162

Graphcalibur · 2022-06-13T06:48:28Z

Some ideas for optimizing approx_count_distinct were discussed in #3121 :

Adding sparse/dense transition. The HyperLogLog implementation in Redis uses a sparse implementation for low cardinalities to save on memory and only switches to the dense implementation for larger cardinalities. Currently, approx_count_distinct only uses the dense implementation.
Changing the max counts for the stream implementation of approx_count_distinct. Currently, each bucket in RegisterBucket can only count up to 2^32/2^16/2^8/1 hashes with a certain number of trailing zeroes with the limit going down as the number of trailing zeroes increase. Perhaps it would be better to change the max count for all the buckets, though at the risk of increased memory usage?
Compress memory usage of stream implementation of approx_count_distinct. Currently, u32 is used to store the counts for hashes with 1 to 16 trailing zeroes. However, since the probability of a hash going into a certain bucket decreases as the number of trailing zeroes increases, then perhaps an array of u64s can be used to store the counts. The first 32 bits count the hashes with 1 trailing zero, the next 31 bits count the hashes with 2 trailing zeroes, and so on. Note that this approach conflicts with the second one.
Better handling of bucket overflow. Currently, the buckets throw an error when an overflow occurs. Maybe there's a better way to handle this?

Benchmarks should also be set up to determine how each configuration affects the estimation error and performance of the algorithm.

The text was updated successfully, but these errors were encountered:

Graphcalibur added the type/enhancement Improvements to existing implementation. label Jun 13, 2022

jon-chuang mentioned this issue Jun 18, 2022

feat(executor): streaming hyperloglog improvements #3315

Merged

neverchanje added the no-issue-activity label Aug 14, 2022

Provide feedback