[Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107

alamb · 2021-04-26T13:21:59Z

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-10275

Group by with a high cardinality (columns with lots of unique values) don't seem to finish.

I've tried with both datafusion-cli and this:

https://github.com/joshuataylor/parquet-group-by/blob/main/src/main.rs

When doing O_ORDERKEY there are ~15 000 000 unique records, so it seems to stall. I've tried with limit but it doesn't work either.

My parquet file: https://drive.google.com/file/d/1aCW7SW2rUVioSePduhgo_91F5-xDMyjp/view?usp=sharing

datafusion-cli:

CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION 'demo.parquet';
select O_ORDERKEY from something group by O_ORDERKEY;

The text was updated successfully, but these errors were encountered:

alamb · 2021-04-26T13:22:00Z

Comment from Andy Grove(andygrove) @ 2020-10-12T21:26:28.047+0000:

I have seen the same behavior. We have mostly been testing hash aggregates with queries that produce low cardinality results and will need to spend time testing for high cardinality results and see how we can optimize this.

alamb · 2021-04-26T13:24:57Z

FYI @joshuataylor

alamb · 2021-04-26T13:34:09Z

The good news is that this doesn't hang anymore... The bad news is that datafusion doesn't seem to yet support sufficient functionality for handling the Decimal type

> CREATE EXTERNAL TABLE something STORED AS PARQUET LOCATION '/Users/alamb/Downloads/demo.parquet';
0 rows in set. Query took 0 seconds.
> select count(*) from foo;
Plan("Table or CTE with name \'foo\' not found")
> select count(*) from something;
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 15000000        |
+-----------------+
1 rows in set. Query took 1 seconds.
> select O_ORDERKEY from something limit 10;
ArrowError(InvalidArgumentError("Pretty printing not implemented for Decimal(9, 0) type"))
> select O_ORDERKEY from something group by O_ORDERKEY;
ArrowError(ExternalError(ArrowError(ExternalError(Internal("Unsupported GROUP BY type creating key Decimal(9, 0)")))))
>

joshuataylor · 2021-04-27T02:51:11Z

Hooray! And fine about the decimal type, that's another issue. Fine to mark this as closed.

alamb · 2021-04-27T10:41:36Z

Filed follow on tickets for decimal numbers. Thanks @joshuataylor

alamb added the datafusion Changes in the datafusion crate label Apr 26, 2021

This was referenced Apr 27, 2021

Add support for pretty-printing Decimal numbers apache/arrow-rs#230

Closed

Add support for group by Decimal numbers #210

Closed

alamb closed this as completed Apr 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107

[Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107

alamb commented Apr 26, 2021 •

edited

Loading

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

joshuataylor commented Apr 27, 2021

alamb commented Apr 27, 2021

[Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107

[Datafusion] GROUP BY with a high cardinality doesn't seem to finish #107

Comments

alamb commented Apr 26, 2021 • edited Loading

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

joshuataylor commented Apr 27, 2021

alamb commented Apr 27, 2021

alamb commented Apr 26, 2021 •

edited

Loading