-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Epic] High cardinality aggregation performance wishlist #11679
Labels
enhancement
New feature or request
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem or challenge?
DataFusion uses a two phase approach to aggregation (see
Accumulator::state
) for details:For low cardinality aggregates (where there are a few distinct groups), this works great 👌 👨🍳
However for high cardinality aggregates (where there are many millions of groups), we can do better by optimizing the path. See the background and ASCII art on #7957 for why the intermediate cardinality increases
This is my wishlist for improving high cardinality aggregates (ideally for the next blog post in a few months #11631 )
Together with the StringView work in #10918 that @XiangpengHao @a10y and others are working on, I think it would provide some very compelling overall speedups in ClickBench and TPCH queries
Also I hear that @avantgardnerio may be interested in helping here
Describe the solution you'd like
Here is my wishlist:
CoalesceBatchesExec
to improve performance #7957 (I have a prototype and some ideas)Describe alternatives you've considered
Do nothing and let DuckDB pass us by ;)
Additional context
Other potential things to do:
The text was updated successfully, but these errors were encountered: