Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Optimize approx_count_distinct and set up benchmarking #3162

Open
Graphcalibur opened this issue Jun 13, 2022 · 0 comments
Open

feat: Optimize approx_count_distinct and set up benchmarking #3162

Graphcalibur opened this issue Jun 13, 2022 · 0 comments
Labels
no-issue-activity type/enhancement Improvements to existing implementation.

Comments

@Graphcalibur
Copy link
Contributor

Graphcalibur commented Jun 13, 2022

Some ideas for optimizing approx_count_distinct were discussed in #3121 :

  1. Adding sparse/dense transition. The HyperLogLog implementation in Redis uses a sparse implementation for low cardinalities to save on memory and only switches to the dense implementation for larger cardinalities. Currently, approx_count_distinct only uses the dense implementation.
  2. Changing the max counts for the stream implementation of approx_count_distinct. Currently, each bucket in RegisterBucket can only count up to 2^32/2^16/2^8/1 hashes with a certain number of trailing zeroes with the limit going down as the number of trailing zeroes increase. Perhaps it would be better to change the max count for all the buckets, though at the risk of increased memory usage?
  3. Compress memory usage of stream implementation of approx_count_distinct. Currently, u32 is used to store the counts for hashes with 1 to 16 trailing zeroes. However, since the probability of a hash going into a certain bucket decreases as the number of trailing zeroes increases, then perhaps an array of u64s can be used to store the counts. The first 32 bits count the hashes with 1 trailing zero, the next 31 bits count the hashes with 2 trailing zeroes, and so on. Note that this approach conflicts with the second one.
  4. Better handling of bucket overflow. Currently, the buckets throw an error when an overflow occurs. Maybe there's a better way to handle this?

Benchmarks should also be set up to determine how each configuration affects the estimation error and performance of the algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-issue-activity type/enhancement Improvements to existing implementation.
Projects
None yet
Development

No branches or pull requests

2 participants