-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement groups accumulator for stddev and variance #12095
Implement groups accumulator for stddev and variance #12095
Conversation
Thanks @eejbyfeldt. Do you have any before/after benchmark results? |
Yes some benchmarks sounds like a great idea. I updated the PR description with something basic queries. FAs far as I could tell stddev/varaiance was not used in any current benchmarks, so I modified two clickbench queries to use stddev instead of avg. The queries were choosen based on them having big improvements when the GroupsAccumulator api was first implemented (according to https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion/) so they are probably showing best case improvement. @andygrove Let me know if you think we should be doing something more advanced. |
847e122
to
525106d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @eejbyfeldt and @andygrove -- I reviewed this PR carefully and I think it is really nicely written and tested. I had some small suggestions on adding comments, but otherwise it is really nice: it uses the existing code beautifully.
🚀
I also made a PR with a benchmark for STDDEV
and VAR
in #12146
Using that benchmark, I confirmed @eejbyfeldt 's results that this PR makes stddev and variance significantly faster.
On main
Elapsed 0.451 seconds.
Elapsed 0.469 seconds.
With this branch
Elapsed 0.331 seconds.
Elapsed 0.336 seconds.
datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/accumulate.rs
Show resolved
Hide resolved
emit_to: datafusion_expr::EmitTo, | ||
) -> (Vec<f64>, NullBuffer) { | ||
let mut counts = emit_to.take_needed(&mut self.counts); | ||
let _ = emit_to.take_needed(&mut self.means); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it took me a while to understand why self.means
was ignored in the final calculation (and thus is there any need to carry it through).
However, with some study I see that the means are needed to calculate m2
during accumulation but are not used in the final output
Maybe we could point that out in a comment (I can do it as a follow on PR too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment in 0f1b64c
*count -= 1; | ||
}); | ||
} | ||
let nulls = NullBuffer::from_iter(counts.iter().map(|&count| count != 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 it took me a while to figure out how nulls were handled -- LGTM
@@ -500,6 +500,85 @@ select stddev(sq.column1) from (values (1.1), (2.0), (3.0)) as sq | |||
---- | |||
0.950438495292 | |||
|
|||
# csv_query_stddev_7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for test coverage
emit_to: datafusion_expr::EmitTo, | ||
) -> (Vec<f64>, NullBuffer) { | ||
let mut counts = emit_to.take_needed(&mut self.counts); | ||
let _ = emit_to.take_needed(&mut self.means); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps there could be a take_needed
that doesn't generate / return the Vec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I think it won't gain much, so probably better to leave as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there is probably no performance gains as it would need to do the same work as take_needed.
I guess it might be worth adding if this becomes a common pattern in more implementations.
Add tests cases for stddev_samp/pop and var_smap/pop the includes a group_by clause.
525106d
to
0f1b64c
Compare
I'll plan to merge this tomorrow unless there are additional comments |
Thanks again @eejbyfeldt and @Dandandan |
The bug was in the orginal implementation in apache#12095. This fixes the issue and modify a test case such that it would have caught it.
The bug was in the orginal implementation in #12095. This fixes the issue and modify a test case such that it would have caught it.
The bug was in the orginal implementation in apache#12095. This fixes the issue and modify a test case such that it would have caught it.
Which issue does this PR close?
Closes #12094.
Rationale for this change
Hopefully improve performance of queries using stddev and variance aggregates.
I did not find any of the current benchmarks containing stddev and/or variance. So I modified clickbench query 31 and 32 to use STDDEV instead of AVG. And these were the results:
Before
After
What changes are included in this PR?
An implementation of GroupsAccumulator for the stddev and variance aggregates.
It extracts the core logic from
NullState::acummulate
to a free functionaccumulate
to make it reusable when implementingGroupsAccumulators
that do not need separate null tracking as can be determine from some other value (in this case the count).Are these changes tested?
Added more test cases in aggregates.slt
Are there any user-facing changes?
No.