-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107
Labels
datafusion
Changes in the datafusion crate
Comments
Comment from Andy Grove(andygrove) @ 2020-10-12T21:26:28.047+0000: I have seen the same behavior. We have mostly been testing hash aggregates with queries that produce low cardinality results and will need to spend time testing for high cardinality results and see how we can optimize this. |
FYI @joshuataylor |
The good news is that this doesn't hang anymore... The bad news is that datafusion doesn't seem to yet support sufficient functionality for handling the Decimal type
|
Hooray! And fine about the decimal type, that's another issue. Fine to mark this as closed. |
This was referenced Apr 27, 2021
Filed follow on tickets for decimal numbers. Thanks @joshuataylor |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-10275
Group by with a high cardinality (columns with lots of unique values) don't seem to finish.
I've tried with both datafusion-cli and this:
https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs
When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to stall. I've tried with limit but it doesn't work either.
My parquet file: https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing
datafusion-cli:
The text was updated successfully, but these errors were encountered: