feat(expr): Implement approx_count_distinct for stream processing #3121

Graphcalibur · 2022-06-10T04:56:03Z

What's changed and what's your intention?

Implement approx_count_distinct for stream processing, supporting both insertion and deletion using a variant of the HyperLogLog as detailed in feature: support approx_count_distinct #2727

Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests
All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

FIxes #2727

…into steven/approx_count_distinct

codecov · 2022-06-10T05:06:51Z

Codecov Report

Merging #3121 (adb165b) into main (97177a9) will increase coverage by 0.03%.
The diff coverage is 88.79%.

@@            Coverage Diff             @@
##             main    #3121      +/-   ##
==========================================
+ Coverage   73.51%   73.54%   +0.03%     
==========================================
  Files         737      738       +1     
  Lines      101665   101887     +222     
==========================================
+ Hits        74735    74937     +202     
- Misses      26930    26950      +20

Flag	Coverage Δ
rust	`73.54% <88.79%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
.../src/executor/aggregation/approx_count_distinct.rs	`88.44% <88.44%> (ø)`
...rc/expr/src/vector_op/agg/approx_count_distinct.rs	`87.83% <100.00%> (+3.73%)`	⬆️
src/stream/src/executor/aggregation/mod.rs	`94.29% <100.00%> (ø)`

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

jon-chuang · 2022-06-10T13:20:23Z

Could you provide the estimates for error rate and max number of rows for the chosen parameters?

For instance, since we have 2^10 registers, we can estimate a maximum of about ~ 16 trillion distinct rows (16T / 2^10 ~ 16B, and u32 can count 4B elements, and probability of first bit being 0 and second being 1 is 1/4, so we can count ~ 16B distinct elements in a register). Of course, this is assuming there is a 0 chance of overflowing a count, so its just a rough estimate.

This is useful info for our users and developers as 16T rows, about 1.6PB, is not unheard of for some users...

Also, what happens if we overflow a count?

Further, the original spec is also wrong.

For instance, if max count for first bit is 2^32, second bit should be 2^31... 17th bit should be 2^16 e.g. since we successively halve the counts.

So the struct should be:

struct RegisterBucket {
    count_1_to_16: [u32; 16],
    count_17_to_24: [u16; 8],
    count_25_to_32: [u8; 8],
}

So the total cost per bucket is 704 bits. 2^10 buckets => ~90KB, 2^16 buckets => ~5.8MB.

We could add further buckets for count > 32, but I'm not sure it would be appropriate, since if we were in that regime of magnitude, then we would be overflowing buckets everywhere and things would stop working..

2^10 also doesn't seem that good, it has, according to the paper, a ~1/32 (1.04 / sqrt(2^10)) error rate. Why not stick to 2^16 (1/128 error rate) like in Redis? 5.8MB seems quite reasonable for a better estimate.

I guess larger number of buckets also helps increase the max distinct count. (2^16 buckets & u32 for first bit => 1 quadrillion rows)

Maybe we could add some notes about future improvements (sparse representation, better estimate for small number of elements)?

Finally, I'd like to see some results or tests that show that the estimates are within the error bounds for some large number of distinct elements.

Something that can run in reasonable time for unit test but still demonstrates it works as expected.

src/stream/src/executor/aggregation/approx_count_distinct.rs

lmatz

We may put the optimization, i.e. counting huge datasets and sparse-dense transition, into future work.
A benchmark can be set up in future PR to show how each configuration affects error bounds empirically. Verify the correctness and help us to understand the performance.
The number of bits for each register can be corrected.
Since we may use more registers to deal with the high-cardinality situation, calculating the bias correction accurately is helpful.

rest LGTM

src/stream/src/executor/aggregation/approx_count_distinct.rs

Graphcalibur · 2022-06-11T08:16:02Z

2^10 also doesn't seem that good, it has, according to the paper, a ~1/32 (1.04 / sqrt(2^10)) error rate. Why not stick to 2^16 (1/128 error rate) like in Redis? 5.8MB seems quite reasonable for a better estimate.

I can update the implementation to use 2^14 registers, which is the same as Redis's implementation of HyperLogLog. This gives it an error rate of ~1/128 and it only uses about 1.44 MB.

After updating the RegisterBucket to your suggestion, the maximum number of distinct rows that can be stored should be about ~141 quadrillion rows (or 2^47 rows). There are 2^14 RegisterBuckets and each RegisterBucket can store ~2^33 distinct rows (since 1/2 of the rows will have 1 as its last bit).

Should this information be documented within the code itself?

Also, what happens if we overflow a count?

One idea I had is to simply not increment a counter once it reaches its maximum value (and also not decrement a counter once it reaches zero). This makes it less accurate for extremely large volumes of data but the error should be negligible for most use cases I believe.

Also, would it be a good idea to store bools as the "count" for 33 to 64?

… method

jon-chuang · 2022-06-11T11:07:57Z

Also, would it be a good idea to store bools as the "count" for 33 to 64?

Well, bool takes the same space as u8. It could be a bitmap I guess if we want to save space?

One idea I had is to simply not increment a counter once it reaches its maximum value

Perhaps we can throw an error? I think it is behaviour outside the bounds of the design.

So we can say: HyperLogLog: count exceeds maximum value. You may be trying to run approx_distinct_count on a stream that has too high overall cardinality, or too many repeated values.

I think its also worth thinking - it could be done at a future validation step - if we have a value that is repeated quite a lot, i.e. has a high count, but which happens to have a hash resulting in high amount of trailing 0s for a particular bucket, we will face this error. In other words, some of the assumptions about the counts aren't good since repeated values are correlated.

So its probably worth taking another look at the bounds for the count again.

Personally, I am in favour of increasing the count max so we don't have to think about the issue, and we can deal with both high row cardinality (independent counts) and high duplicate cardinality (correlated counts).

So we could start by simply have all the counts be u32 or u64. (2048/4096 bits).

Even with all counts using u64 and 2^16 registers, the cost is 33 MB, at an error rate of ~0.4%. And we have a guarantee that the algorithm works for any total cardinality < 2^64. The memory cost is < 6x of:

struct RegisterBucket {
    count_1_to_16: [u32; 16],
    count_17_to_24: [u16; 8],
    count_25_to_32: [u8; 8],
}

@fuyufjh @lmatz what do you guys think? Is this memory cost acceptable to have a deterministic guarantee?

To optimize the memory cost without giving up a deterministic guarantee for inputs, we can utilize a sparse representation:

Resizeable count i.e. bigint.
Have a sparse representation of each register:

struct Register {
  counts_1_to_16: [u64; 16]
  counts_17_to_64: Option<Vec<(u8, u64)>>,
}

(u8, u64): the first number represents the the number of trailing zeros, the second is the count.
Since with 2^16 buckets, for total cardinality < 2^32, we should reasonably expect most bucket to have max count < 2^16, we will not have to deal with the vec representation in most cases.

But I think for this PR, we can just have the simple model:

struct RegisterBucket {
    count_1_to_16: [u32; 16],
    count_17_to_24: [u16; 8],
    count_25_to_32: [u8; 8],
}

Maybe we should open an issue on choosing appropriate max counts and discuss further there.

Should this information be documented within the code itself?

I think you should put a descriptive doc comment replacing:

/// `StreamingApproxCountDistinct` approximates the count of non-null rows using `HyperLogLog`.

We should probably document in developer/user facing docs in the future as well.

jon-chuang · 2022-06-11T11:21:32Z

I think the statement

each RegisterBucket can store ~2^33 distinct rows (since 1/2 of the rows will have 1 as its last bit).

is closer to:

each RegisterBucket can store at most ~2^33 non-distinct rows

So we have an upper bound, but nothing like a practical bound for what cardinality our choice of parameters can handle.

Technically, any bound we give if we want to optimize the size of counts based on probability would be probabilistic. The deterministic bound (worst-case) is the size of the smallest count.

So if our size of the smallest count is 1 bit, we can only guarantee a deterministic result if the total number of rows counted is at most 1.

* Update number of registers in both batch and stream implementation of approx_count_distinct to 2^14 * Change storage of RegisterBucket * Add documentation of estimation error and # of rows that can be counted * Add error handling for register overflow and invalid registers

…gularity-data/risingwave into steven/approx_count_distinct

src/stream/src/executor/aggregation/approx_count_distinct.rs

…gularity-data/risingwave into steven/approx_count_distinct

src/stream/src/executor/aggregation/approx_count_distinct.rs

lmatz · 2022-06-13T04:11:43Z

HyperLogLog: count exceeds maximum value.

We may throw a warning for the non-strict mode as the user may not want to fail the entire job.

Is this memory cost acceptable to have a deterministic guarantee?

It is good.

Maybe we should open an issue on choosing appropriate max counts and discuss further there.

Great, we may revisit this sort of extreme case later if we don't have a strong and clear motivation to choose either way for the moment.

Graphcalibur added 9 commits June 3, 2022 14:32

Add approx_count_distinct

4a11e22

Add ApproxCountDistinct to AggKind

6f6de34

Add binding for approx_count_distinct

0502c11

Merge branch 'main' of https://github.com/singularity-data/risingwave …

27298a4

…into steven/approx_count_distinct

feat(expr): Made small optimizations and fix tests

10f4538

feat(stream): Add approx_count_distinct to stream

daa4742

feat(expr): Add support for deletion on Approx_count_distinct

001795c

Fix merge conflict

6392969

feat(expr): Fix bug

ca8ef15

github-actions bot added the type/feature label Jun 10, 2022

Graphcalibur requested review from fuyufjh and lmatz June 10, 2022 04:56

feat(expr): Add some comments and fix issue with registers

2bf6d6a

TennyZhuang requested review from BugenZhao and BowenXiao1999 June 10, 2022 12:02

jon-chuang reviewed Jun 10, 2022

View reviewed changes

src/stream/src/executor/aggregation/approx_count_distinct.rs Outdated Show resolved Hide resolved

jon-chuang reviewed Jun 10, 2022

View reviewed changes

src/stream/src/executor/aggregation/approx_count_distinct.rs Outdated Show resolved Hide resolved

lmatz reviewed Jun 11, 2022

View reviewed changes

src/stream/src/executor/aggregation/approx_count_distinct.rs Show resolved Hide resolved

feat(expr): Change number of registers and buckets, change count hash…

442d77e

… method

Graphcalibur added 5 commits June 12, 2022 16:18

Fix merge conflict

57cdc86

Merge branch 'steven/approx_count_distinct' of https://github.com/sin…

43dc89b

…gularity-data/risingwave into steven/approx_count_distinct

feat(expr): Change name from register to bucket

4b21e3f

jon-chuang reviewed Jun 12, 2022

View reviewed changes

src/stream/src/executor/aggregation/approx_count_distinct.rs Show resolved Hide resolved

jon-chuang reviewed Jun 12, 2022

View reviewed changes

src/stream/src/executor/aggregation/approx_count_distinct.rs Show resolved Hide resolved

Graphcalibur added 3 commits June 12, 2022 23:16

feat(expr): Add tests for RegisterBucket

4973206

feat(expr): Add tests for RegisterBucket

9697440

Merge branch 'steven/approx_count_distinct' of https://github.com/sin…

adb165b

…gularity-data/risingwave into steven/approx_count_distinct

jon-chuang reviewed Jun 13, 2022

View reviewed changes

src/stream/src/executor/aggregation/approx_count_distinct.rs Show resolved Hide resolved

Graphcalibur requested review from jon-chuang and lmatz June 13, 2022 02:50

lmatz approved these changes Jun 13, 2022

View reviewed changes

Graphcalibur merged commit 725c459 into main Jun 13, 2022

Graphcalibur deleted the steven/approx_count_distinct branch June 13, 2022 06:30

Graphcalibur mentioned this pull request Jun 13, 2022

feat: Optimize approx_count_distinct and set up benchmarking #3162

Open

jon-chuang mentioned this pull request Jun 18, 2022

feat(executor): streaming hyperloglog improvements #3315

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(expr): Implement approx_count_distinct for stream processing #3121

feat(expr): Implement approx_count_distinct for stream processing #3121

Graphcalibur commented Jun 10, 2022

codecov bot commented Jun 10, 2022 •

edited

Loading

jon-chuang commented Jun 10, 2022 •

edited

Loading

lmatz left a comment

Graphcalibur commented Jun 11, 2022

jon-chuang commented Jun 11, 2022 •

edited

Loading

jon-chuang commented Jun 11, 2022 •

edited

Loading

lmatz commented Jun 13, 2022 •

edited

Loading

feat(expr): Implement approx_count_distinct for stream processing #3121

feat(expr): Implement approx_count_distinct for stream processing #3121

Conversation

Graphcalibur commented Jun 10, 2022

What's changed and what's your intention?

Checklist

Refer to a related PR or issue link (optional)

codecov bot commented Jun 10, 2022 • edited Loading

Codecov Report

jon-chuang commented Jun 10, 2022 • edited Loading

lmatz left a comment

Choose a reason for hiding this comment

Graphcalibur commented Jun 11, 2022

jon-chuang commented Jun 11, 2022 • edited Loading

jon-chuang commented Jun 11, 2022 • edited Loading

lmatz commented Jun 13, 2022 • edited Loading

codecov bot commented Jun 10, 2022 •

edited

Loading

jon-chuang commented Jun 10, 2022 •

edited

Loading

jon-chuang commented Jun 11, 2022 •

edited

Loading

jon-chuang commented Jun 11, 2022 •

edited

Loading

lmatz commented Jun 13, 2022 •

edited

Loading