Prune columns / pages that are all null
in ParquetExec
by connecting up row_counts in pruning statistics
#9961
Labels
null
in ParquetExec
by connecting up row_counts in pruning statistics
#9961
Is your feature request related to a problem or challenge?
@appletreeisyellow added
PruningStatistics::row_counts()
in #9223 which allows better pruning of columns which are all null.However, I believe we have not hooked that API up into the
ParquetExec
, so it won't prune row groups based on this information.For example, if column
a
is all NULL, a predicate `a > 5' can never be true, but the the ParquetExec won't be able to prune row groups or pages for this caseDescribe the solution you'd like
Implement
RowGroupPruningStastics::row_counts
https://github.com/apache/arrow-datafusion/blob/2dad90425bacb98a3c2a4214faad53850c93104e/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L345-L347
And
PagesPruningStatistics::row_counts
https://github.com/apache/arrow-datafusion/blob/2dad90425bacb98a3c2a4214faad53850c93104e/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L550-L552
Describe alternatives you've considered
I think the row counts can be found on https://docs.rs/parquet/latest/parquet/format/struct.ColumnMetaData.html
So this ticket should be a matter of copying the row counts correctly and writing some tests in https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/parquet/row_group_pruning.rs / https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/parquet/page_pruning.rs
Additional context
No response
The text was updated successfully, but these errors were encountered: