You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
DataFusion contains logic (originally contributed by @yordan-pavlov in apache/arrow#9064 🎉 ) to perform Row Group Pruning, which skips scanning of entire row groups within a parquet file, based on pushed down predicates (source link in arrow-datafusion: parquet.rs).
The algorithm behind the Row Group Pruning implementation is general and can be applied to any storage system that maintains min/max statistics for different sets of files / chunks of the data and would like to quickly rule out chunks which can not match a predicate.
We would like to reuse the row group pruning logic from DataFusion (rather than writing our own) because we want to make this logic easier to reuse by both other parts of DataFusion (e.g. pruning parquet files rather than just row groups) as well as downstream projects. We also hope to receive benefit ourselves as the community can work to improve this code
In addition, there other usecases, such as the one mentioned by @returnString, where you have a bunch of parquet files in some object store and statistics about the min/max values and you could skip entire files based on those statistics alone.
Describe the solution you'd like
Refactor what is currently called RowGroupPredicateBuilder into something more generic related to Pruning
Rework the implementation so it is generic for a Statistics trait so that the predicates can be evaluated against any type (not just the Parquet RowGroupMetadata)
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
DataFusion contains logic (originally contributed by @yordan-pavlov in apache/arrow#9064 🎉 ) to perform Row Group Pruning, which skips scanning of entire row groups within a parquet file, based on pushed down predicates (source link in arrow-datafusion: parquet.rs).
The algorithm behind the Row Group Pruning implementation is general and can be applied to any storage system that maintains min/max statistics for different sets of files / chunks of the data and would like to quickly rule out chunks which can not match a predicate.
We would like to reuse the row group pruning logic from DataFusion (rather than writing our own) because we want to make this logic easier to reuse by both other parts of DataFusion (e.g. pruning parquet files rather than just row groups) as well as downstream projects. We also hope to receive benefit ourselves as the community can work to improve this code
In addition, there other usecases, such as the one mentioned by @returnString, where you have a bunch of parquet files in some object store and statistics about the min/max values and you could skip entire files based on those statistics alone.
Describe the solution you'd like
RowGroupPredicateBuilder
into something more generic related toPruning
RowGroupMetadata
)Additional context
You can see more about the usecase on the IOx ticket https://github.com/influxdata/influxdb_iox/issues/736 and design document
The text was updated successfully, but these errors were encountered: