You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While profiling a DataFusion query I found that the code spends a lot of time in reading data from parquet files. Predicate / filter push-down is a commonly used performance optimization, where statistics data stored in parquet files (such as min / max values for columns in a parquet row group) is evaluated against query filters to determine which row groups could contain data requested by a query. In this way, by pushing down query filters all the way to the parquet data source, entire row groups or even parquet files can be skipped often resulting in significant performance improvements.
I have been working on an implementation for a few weeks and initial results look promising - with predicate push-down, DataFusion is now faster than Apache Spark (140ms for DataFusion vs 200ms for Spark) for the same query against the same parquet files. And I suspect with the latest improvements to the filter kernel, DataFusion performance will be even better.
My work is based on the following key ideas:
it's best to reuse the existing code for evaluating physical expressions already implemented in DataFusion
filter expressions pushed down to a parquet table are rewritten to use parquet statistics, for example (column / 2) = 4 becomes (column_min / 2) <= 4 && 4 <= (column_max / 2) - this is done once for all files in a parquet table
for each parquet file, a RecordBatch containing all required statistics columns is produced, and the predicate expression from the previous step is evaluated, producing a binary array which is finally used to filter the row groups in each parquet file
Next steps are: integrate this work with latest changes from master branch, publish WIP PR, implement more unit tests
While profiling a DataFusion query I found that the code spends a lot of time in reading data from parquet files. Predicate / filter push-down is a commonly used performance optimization, where statistics data stored in parquet files (such as min / max values for columns in a parquet row group) is evaluated against query filters to determine which row groups could contain data requested by a query. In this way, by pushing down query filters all the way to the parquet data source, entire row groups or even parquet files can be skipped often resulting in significant performance improvements.
I have been working on an implementation for a few weeks and initial results look promising - with predicate push-down, DataFusion is now faster than Apache Spark (140ms for DataFusion vs 200ms for Spark) for the same query against the same parquet files. And I suspect with the latest improvements to the filter kernel, DataFusion performance will be even better.
My work is based on the following key ideas:
it's best to reuse the existing code for evaluating physical expressions already implemented in DataFusion
filter expressions pushed down to a parquet table are rewritten to use parquet statistics, for example
(column / 2) = 4
becomes(column_min / 2) <= 4 && 4 <= (column_max / 2)
- this is done once for all files in a parquet tablefor each parquet file, a RecordBatch containing all required statistics columns is produced, and the predicate expression from the previous step is evaluated, producing a binary array which is finally used to filter the row groups in each parquet file
Next steps are: integrate this work with latest changes from master branch, publish WIP PR, implement more unit tests
@andygrove , @alamb let me know what you think
Reporter: Yordan Pavlov / @yordan-pavlov
Assignee: Yordan Pavlov / @yordan-pavlov
PRs and other links:
Note: This issue was originally created as ARROW-11074. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: