Reusable "row group pruning" logic #363

alamb · 2021-05-19T17:43:50Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

DataFusion contains logic (originally contributed by @yordan-pavlov in apache/arrow#9064 🎉 ) to perform Row Group Pruning, which skips scanning of entire row groups within a parquet file, based on pushed down predicates (source link in arrow-datafusion: parquet.rs).

The algorithm behind the Row Group Pruning implementation is general and can be applied to any storage system that maintains min/max statistics for different sets of files / chunks of the data and would like to quickly rule out chunks which can not match a predicate.

We would like to reuse the row group pruning logic from DataFusion (rather than writing our own) because we want to make this logic easier to reuse by both other parts of DataFusion (e.g. pruning parquet files rather than just row groups) as well as downstream projects. We also hope to receive benefit ourselves as the community can work to improve this code

In addition, there other usecases, such as the one mentioned by @returnString, where you have a bunch of parquet files in some object store and statistics about the min/max values and you could skip entire files based on those statistics alone.

Describe the solution you'd like

Refactor what is currently called RowGroupPredicateBuilder into something more generic related to Pruning
Rework the implementation so it is generic for a Statistics trait so that the predicates can be evaluated against any type (not just the Parquet RowGroupMetadata)

Additional context

You can see more about the usecase on the IOx ticket https://github.com/influxdata/influxdb_iox/issues/736 and design document

The text was updated successfully, but these errors were encountered:

alamb added enhancement New feature or request datafusion Changes in the datafusion crate labels May 19, 2021

alamb self-assigned this May 19, 2021

alamb closed this as completed in #426 May 28, 2021

alamb mentioned this issue Aug 5, 2022

Minor: improve some docstrings about pruning #3041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reusable "row group pruning" logic #363

Reusable "row group pruning" logic #363

alamb commented May 19, 2021

Reusable "row group pruning" logic #363

Reusable "row group pruning" logic #363

Comments

alamb commented May 19, 2021