You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some ideas for optimizing approx_count_distinct were discussed in #3121 :
Adding sparse/dense transition. The HyperLogLog implementation in Redis uses a sparse implementation for low cardinalities to save on memory and only switches to the dense implementation for larger cardinalities. Currently, approx_count_distinct only uses the dense implementation.
Changing the max counts for the stream implementation of approx_count_distinct. Currently, each bucket in RegisterBucket can only count up to 2^32/2^16/2^8/1 hashes with a certain number of trailing zeroes with the limit going down as the number of trailing zeroes increase. Perhaps it would be better to change the max count for all the buckets, though at the risk of increased memory usage?
Compress memory usage of stream implementation of approx_count_distinct. Currently, u32 is used to store the counts for hashes with 1 to 16 trailing zeroes. However, since the probability of a hash going into a certain bucket decreases as the number of trailing zeroes increases, then perhaps an array of u64s can be used to store the counts. The first 32 bits count the hashes with 1 trailing zero, the next 31 bits count the hashes with 2 trailing zeroes, and so on. Note that this approach conflicts with the second one.
Better handling of bucket overflow. Currently, the buckets throw an error when an overflow occurs. Maybe there's a better way to handle this?
Benchmarks should also be set up to determine how each configuration affects the estimation error and performance of the algorithm.
The text was updated successfully, but these errors were encountered:
Some ideas for optimizing
approx_count_distinct
were discussed in #3121 :approx_count_distinct
only uses the dense implementation.RegisterBucket
can only count up to 2^32/2^16/2^8/1 hashes with a certain number of trailing zeroes with the limit going down as the number of trailing zeroes increase. Perhaps it would be better to change the max count for all the buckets, though at the risk of increased memory usage?u32
is used to store the counts for hashes with 1 to 16 trailing zeroes. However, since the probability of a hash going into a certain bucket decreases as the number of trailing zeroes increases, then perhaps an array ofu64s
can be used to store the counts. The first 32 bits count the hashes with 1 trailing zero, the next 31 bits count the hashes with 2 trailing zeroes, and so on. Note that this approach conflicts with the second one.Benchmarks should also be set up to determine how each configuration affects the estimation error and performance of the algorithm.
The text was updated successfully, but these errors were encountered: