Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107

Closed
alamb opened this issue Apr 26, 2021 · 5 comments
Closed

[Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107

alamb opened this issue Apr 26, 2021 · 5 comments
Labels
datafusion Changes in the datafusion crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-10275

Group by with a high cardinality (columns with lots of unique values) don't seem to finish.

I've tried with both datafusion-cli and this:

https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs

When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to stall. I've tried with limit but it doesn't work either.

My parquet file: https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing

datafusion-cli:

CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION 'demo.parquet';
select O_ORDERKEY from something group by O_ORDERKEY;

 

@alamb alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

Comment from Andy Grove(andygrove) @ 2020-10-12T21:26:28.047+0000:

I have seen the same behavior. We have mostly been testing hash aggregates with queries that produce low cardinality results and will need to spend time testing for high cardinality results and see how we can optimize this.

@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

FYI @joshuataylor

@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

The good news is that this doesn't hang anymore... The bad news is that datafusion doesn't seem to yet support sufficient functionality for handling the Decimal type

> CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION '/Users/alamb/Downloads/demo.parquet';
0 rows in set. Query took 0 seconds.
> select count(*) from foo;
Plan("Table or CTE with name \'foo\' not found")
> select count(*) from something;
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 15000000        |
+-----------------+
1 rows in set. Query took 1 seconds.
> select O_ORDERKEY from something limit 10;
ArrowError(InvalidArgumentError("Pretty printing not implemented for Decimal(9, 0) type"))
> select O_ORDERKEY from something group by O_ORDERKEY;
ArrowError(ExternalError(ArrowError(ExternalError(Internal("Unsupported GROUP BY type creating key Decimal(9, 0)")))))
> 

@joshuataylor
Copy link

Hooray! And fine about the decimal type, that's another issue. Fine to mark this as closed.

@alamb
Copy link
Contributor Author

alamb commented Apr 27, 2021

Filed follow on tickets for decimal numbers. Thanks @joshuataylor

@alamb alamb closed this as completed Apr 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate
Projects
None yet
Development

No branches or pull requests

2 participants