ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables #9064

yordan-pavlov · 2020-12-31T21:54:20Z

While profiling a DataFusion query I found that the code spends a lot of time in reading data from parquet files. Predicate / filter push-down is a commonly used performance optimization, where statistics data stored in parquet files (such as min / max values for columns in a parquet row group) is evaluated against query filters to determine which row groups could contain data requested by a query. In this way, by pushing down query filters all the way to the parquet data source, entire row groups or even parquet files can be skipped often resulting in significant performance improvements.

I have been working on an implementation for a few weeks and initial results look promising - with predicate push-down, DataFusion is now faster than Apache Spark (140ms for DataFusion vs 200ms for Spark) for the same query against the same parquet files. Without predicate push-down into parquet, DataFusion takes about 2 - 3s (depending on concurrency) for the same query, because the data is ordered and most files don't contain data that satisfies the query filters, but are still loaded and processed in vain.

This work is based on the following key ideas:

predicate-push down is implemented by filtering row group metadata entries to only those which could contain data that could satisfy query filters
it's best to reuse the existing code for evaluating physical expressions already implemented in DataFusion
filter expressions pushed down to a parquet table are rewritten to use parquet statistics (instead of the actual column data), for example (column / 2) = 4 becomes (column_min / 2) <= 4 && 4 <= (column_max / 2) - this is done once for all files in a parquet table
for each parquet file, a RecordBatch containing all required statistics columns ( [column_min, column_max] in the example above) is produced, and the predicate expression from the previous step is evaluated, producing a binary array which is finally used to filter the row groups in each parquet file

This is still work in progress - more tests left to write; I am publishing this now to gather feedback.

@andygrove let me know what you think

github-actions · 2020-12-31T22:01:12Z

https://issues.apache.org/jira/browse/ARROW-11074

codecov-io · 2021-01-03T11:52:58Z

Codecov Report

Merging #9064 (61f5656) into master (fdf5e88) will decrease coverage by 0.91%.
The diff coverage is 88.73%.

@@            Coverage Diff             @@
##           master    #9064      +/-   ##
==========================================
- Coverage   82.57%   81.66%   -0.92%     
==========================================
  Files         204      215      +11     
  Lines       50327    52093    +1766     
==========================================
+ Hits        41560    42540     +980     
- Misses       8767     9553     +786

Impacted Files	Coverage Δ
rust/datafusion/src/logical_plan/operators.rs	`75.00% <ø> (ø)`
rust/parquet/src/file/serialized_reader.rs	`93.42% <0.00%> (-2.19%)`	⬇️
rust/datafusion/src/physical_plan/parquet.rs	`87.98% <89.70%> (+3.36%)`	⬆️
rust/datafusion/src/datasource/parquet.rs	`94.30% <91.30%> (-1.91%)`	⬇️
rust/arrow/src/array/null.rs	`89.58% <100.00%> (+2.91%)`	⬆️
rust/parquet/src/arrow/array_reader.rs	`72.09% <100.00%> (+0.40%)`	⬆️
rust/datafusion/src/datasource/csv.rs	`65.00% <0.00%> (-16.25%)`	⬇️
rust/datafusion/src/physical_plan/csv.rs	`74.26% <0.00%> (-8.53%)`	⬇️
rust/datafusion/src/physical_plan/functions.rs	`72.29% <0.00%> (-7.71%)`	⬇️
rust/datafusion/src/physical_plan/limit.rs	`57.47% <0.00%> (-5.82%)`	⬇️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fdf5e88...61f5656. Read the comment docs.

rust/datafusion/src/physical_plan/parquet.rs

alamb

I think this is looking great - thank you @yordan-pavlov

I like the high level approach and algorithm and the implementation looks good.

👍 thank you

alamb · 2021-01-03T13:31:45Z

rust/datafusion/src/physical_plan/parquet.rs

@@ -209,6 +251,479 @@ impl ParquetExec {
    }
 }

+#[derive(Debug, Clone)]
+/// Predicate builder used for generating of predicate functions, used to filter row group metadata
+pub struct PredicateExpressionBuilder {


nit: probably this doesn't need to be a pub struct given that it seems to be tied to the parquet scan implementation

I don't feel strongly about this either way, but at the moment it has to be public because it is used as a parameter in pub ParquetExec::new

I don't feel strongly either way either -- no need to change it

thinking some more about this, it could be done by moving the creation of PredicateExpressionBuilder from ParquetExec::try_from_files into (ExecutionPlan for ParquetExec)::execute, but then this work would be repeated for each partition, where as currently it's only done once; at this point I don't think it's worth it.

alamb · 2021-01-03T13:36:26Z

rust/datafusion/src/physical_plan/parquet.rs

+        Operator::Eq => {
+            let min_column_name = expr_builder.add_min_column();
+            let max_column_name = expr_builder.add_max_column();
+            // column = literal => column = (min, max) => min <= literal && literal <= max


these comments are quite helpful. Thank you

this particular comment should be // column = literal => (min, max) = literal => min <= literal && literal <= max :), but yes, it does require some thinking so I thought it would be good to add these comments to help with the process

alamb · 2021-01-03T13:38:09Z

rust/datafusion/src/physical_plan/parquet.rs

+}
+
+/// Translate logical filter expression into parquet statistics physical filter expression
+fn build_predicate_expression(


I suggest copying the (nicely written) summary of your algorithm from this PR' description somewhere into this file

It is probably good to mention the assumptions of this predicate expression -- which I think is that it will return true if a rowgroup may contain rows that match the predicate, and will return false if and only if all rows in the row group can not match the predicate.

The idea of creating arrays of (col1_min, col1_max, col2_min, col2_max ...) is clever (and could likely be applied to sources other than parquet files).

alamb · 2021-01-03T13:43:29Z

rust/datafusion/src/physical_plan/parquet.rs

+            // (column / 2) = 4 => (column_min / 2) <= 4 && 4 <= (column_max / 2)
+            expr_builder
+                .scalar_expr()
+                .gt_eq(rewrite_column_expr(


stylistically, you might be able to hoist out the repeated calls to

rewrite_column_expr( expr_builder.column_expr(), expr_builder.column_name(), max_column_name.as_str()

and

rewrite_column_expr( expr_builder.column_expr(), expr_builder.column_name(), min_column_name.as_str()

by evaluating them once before the match expression:

let min_col_expr = rewrite_column_expr( expr_builder.column_expr(), expr_builder.column_name(), min_column_name.as_str()); let max_col_expr = rewrite_column_expr( expr_builder.column_expr(), expr_builder.column_name(), max_column_name.as_str())

But they way you have it works well too

@alamb I have just pushed a change which removes this repetition and makes that part of the code cleaner; not long to go now, some more tests to add for the execution of the row group predicate in the next couple of days and this work should be ready to merge

Awesome -- thanks @yordan-pavlov -- I am excited for this one. When it is ready I'll re-review the code and get it merged asap.

Thanks again for introducing this feature. 🎉

alamb · 2021-01-03T13:46:17Z

rust/datafusion/src/physical_plan/parquet.rs

+    Max,
+}
+
+fn build_null_array(data_type: &DataType, length: usize) -> ArrayRef {


I wonder if you could use NullArray here instead: https://github.com/apache/arrow/blob//rust/arrow/src/array/null.rs

that's what I thought at first, and then realized that NullArray returns data type of DataType::Null, which doesn't work when the statistics record batch is created as it checks that types from the schema fields and from arrays are the same; that's why I wrote the build_null_array function

makes sense

@alamb I have now changed the code to use NullArray but have had to add a new new_with_type constructor function (for the reason explained in my previous comment)

alamb · 2021-01-03T13:51:54Z

rust/parquet/src/file/serialized_reader.rs

@@ -137,6 +137,22 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> {
            metadata,
        })
    }
+
+    pub fn filter_row_groups(


this is a fancy way to filter out the row groups -- it is probably worth adding documentation here.

I don't know if there are assumptions in the parquet reader code that the row group metadata matches what was read from the file or not

I suggest you consider filtering the row groups at the DataFusion (aka skip them in the datafusion physical operator) level rather than in the parquet reader level and avoid that potential problem completely.

Yeah I think we can either move this to the application layer (i.e., data fusion), or expose it as a util function from footer.rs.

Good point about documentation - will add some shortly.

As long as row group metadata is filtered immediately after creating a SerializedFileReader, this approach will work.

That's the simplest way I could think of to allow filtering of row groups using statistics metadata; not sure how this could be done within DataFusion itself, because it reads data in batches (of configurable size) which could potentially span multiple row groups; it could be done, but would probably move a lot of complexity into DataFusion which today is nicely abstracted into the parquet library. This would also expose a lot more about the internals of a parquet file format to the outside as the user would have to be aware of row groups rather than just requesting batches of data.
May be I misunderstand what you are suggesting?

there is another possibility - I have just noticed FilePageIterator::with_row_groups which could be used to filter row groups based on a list of row group indexes; this could replace the filter_row_groups method but would require the row group indexes to be passed down all the way to build_for_primitive_type_inner where FilePageIterator is created; this could be done through a new field in ArrayReaderBuilderContext.
It's a deeper change but would mean that filter_row_groups method is no longer necessary. @sunchao do you think this would be a better way to go about filtering of row groups? I am not sure the complexity is worth it.

What I was thinking is that, we can have another constructor for SerializedFileReader which takes a custom metadata:

pub fn new_with_metadata(chunk_reader: R, metadata: ParquetMetaData) -> Result<Self> { Ok(Self { chunk_reader: Arc::new(chunk_reader), metadata: metadata, }) }

and we move the metadata filtering part to data fusion, or a util function in footer.rs.

In the long term though, I think we should do something similar to parquet-mr is doing, that is, having a ParquetReadOptions-like struct which allows user to specify various configs, properties, filters, etc when reading a parquet file. The struct is extendable as well to accommodate new features in future such as filtering with column indexes or bloom filters, so we don't need to have multiple constructors. The constructor can become like this:

pub fn new(chunk_reader: R, options: ParquetReadOptions) -> Result<Self> { Ok(Self { chunk_reader: Arc::new(chunk_reader), options: options, }) }

the second option, with the ParquetReadOptions parameter, sounds better (compared to the new_with_metadata method) - more extensible as you have described; however I think this falls outside of the scope of this PR;

one issue I can think of, though, is that the code needs to read the statistics metadata from the parquet file, in order create the statistics record batch, execute the predicate expression on it, and then use the results to filter the parquet row groups; this could still work, if the parquet metadata can be read before SerializedFileReader is crated using the proposed constructor

however I think this falls outside of the scope of this PR;

I agree -- this is already a large enough PR (and important enough). If we need to add some non ideal api to parquet and then upgrade it later I think that is the better approach.

Yeah I didn't mean we should tackle it here - which is why I said "in the long term" :-)

andygrove · 2021-01-03T18:40:01Z

rust/datafusion/src/datasource/parquet.rs

+            None
+        } else {
+            Some(
+                filters


Could you add a comment explaining the logic here? It isn't immediately obvious to me.

Immediately after posting that comment I see how it works now, but I think a comment would still be helpful

jorgecarleitao

I gave a quick overview to this, and while I am not the parquet resident expert (👀 @nevi-me), I was able to follow the idea and understand what it is happening. I think it is a great design (using min,max) and implementation so far.

I left some comments to the implementation of building the Arrow arrays, but other than that, really good work here so far! 💯

jorgecarleitao · 2021-01-04T07:42:29Z

rust/datafusion/src/physical_plan/parquet.rs

+                Box::new(move |_, i| predicate_values[i])
+            }
+            // predicate result is not a BooleanArray
+            _ => Box::new(|_r, _i| true),


I would error or panic! here or before that, or validate that the predicate is a boolean array.

My thinking in designing this has been that pushing the predicate down to parquet is optional, because even if it fails the query will still compute, just slower; because of that the code tries to avoid panicking and instead returns a predicate which returns true - it doesn't filter any row groups and lets them be processed by downstream operators.
It is even possible to have a partial predicate expression, where multiple conditions are joined using a logical AND, and some of them can't be translated for some reason to physical expressions, they will be replaced by true, but the rest will still be evaluated and could still filter some row groups.

jorgecarleitao · 2021-01-04T07:51:50Z

rust/datafusion/src/physical_plan/parquet.rs

+            })
+    });
+
+    if arrow_type == DataType::Utf8 {


I would use a match, as this is a bit brittle against matching specific datatypes.

this may not be very idiomatic Rust, but allows the code to handle this single special case separately

jorgecarleitao · 2021-01-04T07:54:34Z

rust/datafusion/src/physical_plan/parquet.rs

+    make_array(array_data)
+}
+
+fn build_statistics_array(


I would have split this in N functions, one per array type (via generics), and write build_statistics_array as simply match data_type { each implementation }.

This would follow the convention in other places and reduces the risk of mistakes, particularly in matching datatypes.

jorgecarleitao · 2021-01-04T07:57:20Z

rust/datafusion/src/physical_plan/parquet.rs

+    let mut builder = ArrayData::builder(arrow_type)
+        .len(statistics_count)
+        .add_buffer(data_buffer.into());
+    if null_count > 0 {
+        builder = builder.null_bit_buffer(bitmap_builder.finish());
+    }
+    let array_data = builder.build();
+    let statistics_array = make_array(array_data);
+    if statistics_array.data_type() == data_type {
+        return statistics_array;
+    }


This is only valid for primitive types. In general, I would recommend using PrimitiveArray<T>::from_iter, BooleanArray::from_iter and StringArray::from_iter. Using MutableBuffer in this high level is prone to errors. E.g. if we add a filter for boolean types (e.g. eq and neq), this does not panic but the array is not valid (as the size is measured in bits, not bytes).

Thank you for your feedback. I was looking for a (mostly) generic approach to building statistics arrays and this is the simplest implementation I could come up with. Using MutableBuffer may be prone to errors, but I have added a test to confirm it's working. Your questions make me think if this could be done with generics though.
From what I have seen, parquet statistics are only stored as Int32, Int64, Float, Double or ByteArray (used for strings and other complex types); may be someone with more experience with parquet can advise on how statistics work for boolean columns.

@jorgecarleitao actually what I wrote in my previous comment is incorrect - boolean is a valid statistics type; although in most cases I suspect that it may not provide very helpful statistics (because it only has two values - true and false); anyway I will look into a better implementation for the build_statistics_array method and support for more types, but probably in a separate PR as this one is already fairly large

sunchao · 2021-01-04T20:44:23Z

rust/datafusion/src/datasource/parquet.rs

    /// Scan the file(s), using the provided projection, and return one BatchIterator per
    /// partition.
    fn scan(
        &self,
        projection: &Option<Vec<usize>>,
        batch_size: usize,
-        _filters: &[Expr],
+        filters: &[Expr],


Ideally, I feel we should have a proper filter API defined in data fusion which can be shared among various data sources. On the other hand, the actual filtering logic should be implemented by different data sources / formats, probably via converting the data fusion's filter API to the corresponding ones from the latter.

But this is a very good start and we can probably do them as follow ups (if we don't care much for API changes).

I agree that starting with this PR and then extending the approach to be something more generic is a good approach.

sunchao · 2021-01-04T20:45:01Z

rust/parquet/src/file/serialized_reader.rs

@@ -137,6 +137,22 @@ impl<R: 'static + ChunkReader> SerializedFileReader<R> {
            metadata,
        })
    }
+
+    pub fn filter_row_groups(


Yeah I think we can either move this to the application layer (i.e., data fusion), or expose it as a util function from footer.rs.

…e_expression

…orks, plus one more test

…_push_down

alamb

I went over the changes in this PR since I last reviewed it and I think it is looking good. Thank you so much for all the work @yordan-pavlov

I recommend we merge this PR soon and handle subsequent improvements (e.g. adding the algorithm description into comments and creating parquet reader options) as subsequent PRs -- this one is already large and with lots of commentary.

alamb · 2021-01-15T18:46:29Z

rust/datafusion/src/physical_plan/parquet.rs

+            .downcast_ref::<StringArray>()
+            .unwrap();
+        let string_vec = string_array.into_iter().collect::<Vec<_>>();
+        // here the first max value is None and not the Some("10") value which was actually set


alamb · 2021-01-15T18:52:21Z

@yordan-pavlov do you think this PR is ready to merge?

yordan-pavlov · 2021-01-15T18:56:19Z

@alamb yes I think this is ready to merge and as you said, already large enough

alamb · 2021-01-15T19:20:33Z

Sounds good. I'll plan to merge it once master is opened for 4.0 commits (eta tomorrow). Thanks again

alamb · 2021-01-17T11:12:08Z

I apologize for the delay in merging Rust PRs -- the 3.0 release is being finalized now and are planning to minimize entropy by postponing merging changes not critical for the release until the process was complete. I hope the process is complete in the next few days. There is more discussion in the mailing list

alamb · 2021-01-19T15:51:10Z

Thanks again @yordan-pavlov -- I am totally stoked for this feature

Dandandan · 2021-01-19T16:17:07Z

I think this is one of the big features of 4.0 already! Thanks @yordan-pavlov great work

@andygrove

…quet tables While profiling a DataFusion query I found that the code spends a lot of time in reading data from parquet files. Predicate / filter push-down is a commonly used performance optimization, where statistics data stored in parquet files (such as min / max values for columns in a parquet row group) is evaluated against query filters to determine which row groups could contain data requested by a query. In this way, by pushing down query filters all the way to the parquet data source, entire row groups or even parquet files can be skipped often resulting in significant performance improvements. I have been working on an implementation for a few weeks and initial results look promising - with predicate push-down, DataFusion is now faster than Apache Spark (`140ms for DataFusion vs 200ms for Spark`) for the same query against the same parquet files. Without predicate push-down into parquet, DataFusion takes about 2 - 3s (depending on concurrency) for the same query, because the data is ordered and most files don't contain data that satisfies the query filters, but are still loaded and processed in vain. This work is based on the following key ideas: * predicate-push down is implemented by filtering row group metadata entries to only those which could contain data that could satisfy query filters * it's best to reuse the existing code for evaluating physical expressions already implemented in DataFusion * filter expressions pushed down to a parquet table are rewritten to use parquet statistics (instead of the actual column data), for example `(column / 2) = 4` becomes `(column_min / 2) <= 4 && 4 <= (column_max / 2)` - this is done once for all files in a parquet table * for each parquet file, a RecordBatch containing all required statistics columns ( [`column_min`, `column_max`] in the example above) is produced, and the predicate expression from the previous step is evaluated, producing a binary array which is finally used to filter the row groups in each parquet file This is still work in progress - more tests left to write; I am publishing this now to gather feedback. @andygrove let me know what you think Closes #9064 from yordan-pavlov/parquet_predicate_push_down Authored-by: Yordan Pavlov <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>

@andygrove

…quet tables While profiling a DataFusion query I found that the code spends a lot of time in reading data from parquet files. Predicate / filter push-down is a commonly used performance optimization, where statistics data stored in parquet files (such as min / max values for columns in a parquet row group) is evaluated against query filters to determine which row groups could contain data requested by a query. In this way, by pushing down query filters all the way to the parquet data source, entire row groups or even parquet files can be skipped often resulting in significant performance improvements. I have been working on an implementation for a few weeks and initial results look promising - with predicate push-down, DataFusion is now faster than Apache Spark (`140ms for DataFusion vs 200ms for Spark`) for the same query against the same parquet files. Without predicate push-down into parquet, DataFusion takes about 2 - 3s (depending on concurrency) for the same query, because the data is ordered and most files don't contain data that satisfies the query filters, but are still loaded and processed in vain. This work is based on the following key ideas: * predicate-push down is implemented by filtering row group metadata entries to only those which could contain data that could satisfy query filters * it's best to reuse the existing code for evaluating physical expressions already implemented in DataFusion * filter expressions pushed down to a parquet table are rewritten to use parquet statistics (instead of the actual column data), for example `(column / 2) = 4` becomes `(column_min / 2) <= 4 && 4 <= (column_max / 2)` - this is done once for all files in a parquet table * for each parquet file, a RecordBatch containing all required statistics columns ( [`column_min`, `column_max`] in the example above) is produced, and the predicate expression from the previous step is evaluated, producing a binary array which is finally used to filter the row groups in each parquet file This is still work in progress - more tests left to write; I am publishing this now to gather feedback. @andygrove let me know what you think Closes apache#9064 from yordan-pavlov/parquet_predicate_push_down Authored-by: Yordan Pavlov <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>

@andygrove

…quet tables While profiling a DataFusion query I found that the code spends a lot of time in reading data from parquet files. Predicate / filter push-down is a commonly used performance optimization, where statistics data stored in parquet files (such as min / max values for columns in a parquet row group) is evaluated against query filters to determine which row groups could contain data requested by a query. In this way, by pushing down query filters all the way to the parquet data source, entire row groups or even parquet files can be skipped often resulting in significant performance improvements. I have been working on an implementation for a few weeks and initial results look promising - with predicate push-down, DataFusion is now faster than Apache Spark (`140ms for DataFusion vs 200ms for Spark`) for the same query against the same parquet files. Without predicate push-down into parquet, DataFusion takes about 2 - 3s (depending on concurrency) for the same query, because the data is ordered and most files don't contain data that satisfies the query filters, but are still loaded and processed in vain. This work is based on the following key ideas: * predicate-push down is implemented by filtering row group metadata entries to only those which could contain data that could satisfy query filters * it's best to reuse the existing code for evaluating physical expressions already implemented in DataFusion * filter expressions pushed down to a parquet table are rewritten to use parquet statistics (instead of the actual column data), for example `(column / 2) = 4` becomes `(column_min / 2) <= 4 && 4 <= (column_max / 2)` - this is done once for all files in a parquet table * for each parquet file, a RecordBatch containing all required statistics columns ( [`column_min`, `column_max`] in the example above) is produced, and the predicate expression from the previous step is evaluated, producing a binary array which is finally used to filter the row groups in each parquet file This is still work in progress - more tests left to write; I am publishing this now to gather feedback. @andygrove let me know what you think Closes apache#9064 from yordan-pavlov/parquet_predicate_push_down Authored-by: Yordan Pavlov <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>

@andygrove

…quet tables While profiling a DataFusion query I found that the code spends a lot of time in reading data from parquet files. Predicate / filter push-down is a commonly used performance optimization, where statistics data stored in parquet files (such as min / max values for columns in a parquet row group) is evaluated against query filters to determine which row groups could contain data requested by a query. In this way, by pushing down query filters all the way to the parquet data source, entire row groups or even parquet files can be skipped often resulting in significant performance improvements. I have been working on an implementation for a few weeks and initial results look promising - with predicate push-down, DataFusion is now faster than Apache Spark (`140ms for DataFusion vs 200ms for Spark`) for the same query against the same parquet files. Without predicate push-down into parquet, DataFusion takes about 2 - 3s (depending on concurrency) for the same query, because the data is ordered and most files don't contain data that satisfies the query filters, but are still loaded and processed in vain. This work is based on the following key ideas: * predicate-push down is implemented by filtering row group metadata entries to only those which could contain data that could satisfy query filters * it's best to reuse the existing code for evaluating physical expressions already implemented in DataFusion * filter expressions pushed down to a parquet table are rewritten to use parquet statistics (instead of the actual column data), for example `(column / 2) = 4` becomes `(column_min / 2) <= 4 && 4 <= (column_max / 2)` - this is done once for all files in a parquet table * for each parquet file, a RecordBatch containing all required statistics columns ( [`column_min`, `column_max`] in the example above) is produced, and the predicate expression from the previous step is evaluated, producing a binary array which is finally used to filter the row groups in each parquet file This is still work in progress - more tests left to write; I am publishing this now to gather feedback. @andygrove let me know what you think Closes apache#9064 from yordan-pavlov/parquet_predicate_push_down Authored-by: Yordan Pavlov <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>

github-actions bot added Component: Rust - DataFusion Component: Rust Component: Parquet labels Dec 31, 2020

yordan-pavlov added 5 commits January 3, 2021 11:20

implement predicate push-down for parquet tables

a646088

fix rustfmt issues

0f28e8d

fix clippy issues

b00daec

fix array_reader tests, broken with previous change

bbaf809

rebase onto master and fix MutableBuffer.freeze() warning

5b2b6f2

yordan-pavlov force-pushed the parquet_predicate_push_down branch from 9e6747f to 5b2b6f2 Compare January 3, 2021 11:33

Dandandan reviewed Jan 3, 2021

View reviewed changes

rust/datafusion/src/physical_plan/parquet.rs Outdated Show resolved Hide resolved

Dandandan reviewed Jan 3, 2021

View reviewed changes

rust/datafusion/src/physical_plan/parquet.rs Outdated Show resolved Hide resolved

alamb reviewed Jan 3, 2021

View reviewed changes

andygrove reviewed Jan 3, 2021

View reviewed changes

yordan-pavlov added 3 commits January 3, 2021 21:59

add test for no pages case for ComplexObjectArrayReader

a03ccf9

implement Copy trait for StatisticsType

2e78552

implement Copy trait for Operator

5db2c6f

jorgecarleitao reviewed Jan 4, 2021

View reviewed changes

sunchao reviewed Jan 4, 2021

View reviewed changes

yordan-pavlov added 8 commits January 5, 2021 22:01

add tests for combine_filters function

3f619d8

Merge 'upstream/master' into parquet_predicate_push_down

d1af10d

refactor StatisticsExpressionBuilder and add tests for build_predicat…

5d7dda7

…e_expression

add more tests for build_statistics_array and RowGroupPredicateBuilder

61f5656

add test for NullArray::new_with_type

30d6d7d

add test for SerializedFileReader::filter_row_groups

750b8a2

add more comments to better explain how parquet predicate push-down w…

0f4e18e

…orks, plus one more test

fix clippy warnings

6e0f409

yordan-pavlov marked this pull request as ready for review January 13, 2021 23:06

Merge remote-tracking branch 'upstream/master' into parquet_predicate…

2de3d6c

…_push_down

fix clippy errors in tests

bd22f64

alamb reviewed Jan 15, 2021

View reviewed changes

alamb approved these changes Jan 15, 2021

View reviewed changes

alamb closed this in 18dc62c Jan 19, 2021

alamb mentioned this pull request May 19, 2021

Reusable "row group pruning" logic apache/datafusion#363

Closed

yordan-pavlov mentioned this pull request Mar 23, 2022

Build physical filter eagerly while constructing ParquetExec operator apache/datafusion#2059

Open

asfimport mentioned this pull request Jan 19, 2021

[Rust][DataFusion] Implement predicate push-down for parquet tables #26987

Closed

ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables #9064

ARROW-11074: [Rust][DataFusion] Implement predicate push-down for parquet tables #9064

Conversation

yordan-pavlov commented Dec 31, 2020 • edited Loading

github-actions bot commented Dec 31, 2020

codecov-io commented Jan 3, 2021 • edited Loading

Codecov Report

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yordan-pavlov Jan 4, 2021 • edited Loading

Choose a reason for hiding this comment

sunchao Jan 5, 2021 • edited Loading

Choose a reason for hiding this comment

yordan-pavlov Jan 5, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgecarleitao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yordan-pavlov Jan 4, 2021 • edited Loading

Choose a reason for hiding this comment

yordan-pavlov Jan 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jan 15, 2021

yordan-pavlov commented Jan 15, 2021

alamb commented Jan 15, 2021

alamb commented Jan 17, 2021

alamb commented Jan 19, 2021

Dandandan commented Jan 19, 2021

yordan-pavlov commented Dec 31, 2020 •

edited

Loading

codecov-io commented Jan 3, 2021 •

edited

Loading

yordan-pavlov Jan 4, 2021 •

edited

Loading

sunchao Jan 5, 2021 •

edited

Loading

yordan-pavlov Jan 5, 2021 •

edited

Loading

yordan-pavlov Jan 4, 2021 •

edited

Loading

yordan-pavlov Jan 13, 2021 •

edited

Loading