Allow Overriding AsyncFileReader used by ParquetExec #3051

Cheappie · 2022-08-06T09:50:50Z

First steps towards overriding AsyncFileReader used by ParquetExec
#2992

codecov-commenter · 2022-08-06T13:19:47Z

Codecov Report

Merging #3051 (70bc233) into master (098f0b0) will increase coverage by 0.01%.
The diff coverage is 92.45%.

❗ Current head 70bc233 differs from pull request most recent head e18d26f. Consider uploading reports for the commit e18d26f to get more accurate results

@@            Coverage Diff             @@
##           master    #3051      +/-   ##
==========================================
+ Coverage   85.93%   85.95%   +0.01%     
==========================================
  Files         289      290       +1     
  Lines       52118    52243     +125     
==========================================
+ Hits        44788    44903     +115     
- Misses       7330     7340      +10

Impacted Files	Coverage Δ
datafusion/core/src/datasource/listing/mod.rs	`55.55% <ø> (ø)`
...afusion/core/src/physical_plan/file_format/avro.rs	`0.00% <ø> (ø)`
...tafusion/core/src/physical_plan/file_format/mod.rs	`96.95% <50.00%> (-0.42%)`	⬇️
.../core/src/physical_plan/file_format/file_stream.rs	`88.88% <69.23%> (-2.10%)`	⬇️
datafusion/core/src/datasource/listing/helpers.rs	`95.01% <93.75%> (+0.05%)`	⬆️
...sion/core/src/physical_plan/file_format/parquet.rs	`95.34% <93.93%> (-0.12%)`	⬇️
datafusion/core/tests/custom_parquet_reader.rs	`95.50% <95.50%> (ø)`
datafusion/core/src/datasource/file_format/mod.rs	`91.30% <100.00%> (+0.39%)`	⬆️
...afusion/core/src/datasource/file_format/parquet.rs	`86.26% <100.00%> (ø)`
...tafusion/core/src/physical_plan/file_format/csv.rs	`94.47% <100.00%> (ø)`
... and 3 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

tustvold

I think this is a good step forward, left some minor nits

datafusion/core/src/physical_plan/file_format/file_stream.rs

datafusion/core/src/physical_plan/file_format/parquet.rs

tustvold · 2022-08-08T07:38:01Z

datafusion/core/src/physical_plan/file_format/parquet.rs

+    }
+}
+
+struct ThinFileReader {


We should probably fix the need for this upstream. In the meantime perhaps we could called this BoxedAsyncFileReader to make clear what it is for.

Might also make it a tuple struct to hammer home the fact it is just a newtype wrapper. Doc comment would also help

sure, by the way do you know any better way to solve that than for example making upstream accept Box<dyn AsyncFileReader + Send> in constructor ?

Within parquet we could impl AsyncFileReader for Box<dyn AsyncFileReader>, but tbh I wonder if we should just type-erase by default 🤔

Fix in apache/arrow-rs#2368

datafusion/core/src/physical_plan/file_format/parquet.rs

datafusion/core/src/datasource/listing/mod.rs

datafusion/core/src/physical_plan/file_format/parquet.rs

tustvold

Looks good to me, thank you 👍

…mplementations

Cheappie · 2022-08-08T12:13:36Z

Looks good to me, thank you 👍

Thank you @tustvold for code review, It helped me a lot to make code nicer.

tustvold · 2022-08-08T14:20:01Z

Failing clippy lint, then this can go in

Cheappie · 2022-08-08T14:42:53Z

I have added fixes, pr might pass clippy lints now. But It might also fail, because compiler reports that there is something wrong with AsyncFileReader trait.

warning: the trait `AsyncFileReader` cannot be made into an object
warning: this was previously accepted by the compiler but is being phased out; it will become a hard error in a future release!
116 |     fn get_byte_ranges(
    |        ^^^^^^^^^^^^^^^ the trait cannot be made into an object because method `get_byte_ranges` references the `Self` type in its `where` clause

tustvold · 2022-08-08T15:08:08Z

Oops, let me fix that upstream. In the meantime can you add an #[allow] annotation

Edit: Upstream fix apache/arrow-rs#2366

Cheappie · 2022-08-08T15:20:20Z

I have added allows in few places, related to where the issue originates and some top levels, but compiler can't get silent about that.

I have tried such allow at the top of the file too
#![allow(where_clauses_object_safety)]

By the way compiler points to parquet crate, parquet-19.0.0/src/arrow/async_reader.rs:116:8 as origin of the error

tustvold · 2022-08-08T15:33:57Z

Let me have a play and see if I can find some way to placate clippy

tustvold · 2022-08-08T15:43:14Z

So sticking #![allow(where_clauses_object_safety)] at the top of datafusion/core/src/lib.rs is the only way I've found to make this work...

I'll have a chat with @alamb and see what would be the circumstances for a patch release

tustvold · 2022-08-08T16:50:53Z

Hacky workaround in Cheappie#1

@alamb if you could give this a once over when you have a moment, this is the least terrible way I have found to make this work...

alamb

LGTM

My biggest recommendation is to add a end-to-end example (like https://github.com/apache/arrow-datafusion/blob/master/datafusion-examples/examples/custom_datasource.rs#L160) for two reasons:

Possibly help others use this feature
(More important) ensure that the plumbing (especially the user defined extensions, for example) continues to work. I think this code is likely to undergo some significant churn over the next few months as we push more predicates down closer to parquet and without some end to end test we may end up breaking something inadvertently

cc @thinkharderdev and @yjshen

alamb · 2022-08-09T12:12:24Z

datafusion/core/src/physical_plan/file_format/mod.rs

@@ -401,6 +404,33 @@ fn create_dict_array(
    ))
 }

+/// A single file or part of a file that should be read, along with its schema, statistics


Maybe I am missing something but I don't see "schema" and "statistics" on this struct

Perhaps we should refer to "extensions" instead?

This description has been copied from PartitionedFile, at first I had similar impression that something is wrong, but after second read actually It doesn't tell that schema and statistics are part of the struct, but rather that these should be read from file along with data

alamb · 2022-08-09T12:12:39Z

datafusion/core/src/physical_plan/file_format/mod.rs

@@ -401,6 +404,33 @@ fn create_dict_array(
    ))
 }

+/// A single file or part of a file that should be read, along with its schema, statistics
+pub struct FileMeta {


This struct makes a lot of sense to me

alamb · 2022-08-09T12:14:56Z

datafusion/core/src/physical_plan/file_format/parquet.rs

+    fn create_reader(
+        &self,
+        partition_index: usize,
+        file_meta: FileMeta,


the file meta contains user defined extensions too, right? So users can pass information down into their custom readers?

yes, exactly that. But afaik they will be able to achieve that only through custom TableProvider. Because extensions are passed through PartitionedFile into FileMeta, where higher level abstraction like ObjectStore produces ObjectMeta in listing operations.

makes sense

alamb · 2022-08-09T12:17:08Z

datafusion/core/src/physical_plan/file_format/parquet.rs

+
+///
+/// BoxedAsyncFileReader has been created to satisfy type requirements of
+/// parquet stream builder constructor.


Is this necessary after apache/arrow-rs#2366? If not perhaps we can add a reference here so we can eventually clean it up?

apache/arrow-rs#2368 is the one that will remove the need for this

alamb · 2022-08-09T13:22:43Z

Here is another example of a test that ensures user defined plan nodes don't get messed up: https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/user_defined_plan.rs (this has saved me several times from breaking them downstream)

Cheappie · 2022-08-09T13:57:35Z

I have added test that checks whether data access ops are routed to parquet file reader factory, It seems that It works as expected

alamb · 2022-08-09T14:13:36Z

I have added test that checks whether data access ops are routed to parquet file reader factory, is it sufficient ?

The only other thing I would suggest is making it a "cargo integration test" by putting it in datafusion/core/tests/custom_parquet_reader.rs or something. The reason is that by being in datafusion/core/src/datasource/file_format/parquet.rs will allow it access to crate local things (so for example, the test will not catch if the necessary traits are accidentally not publicly exposed, which has happened when we have moved code around in the past 🤦 )

alamb · 2022-08-09T14:13:26Z

datafusion/core/src/physical_plan/file_format/parquet.rs

+    }
+
+    #[derive(Debug)]
+    struct InMemoryParquetFileReaderFactory(Arc<dyn ObjectStore>);


this test looks great

I have moved that test to integration tests, but for users convenience I had to expose two things ParquetFileMetrics and fn fetch_parquet_metadata(...). I have added docs that describe that these things are subjects to change in near future and are exposed for low level integrations. Should we worry about exposing impl details, or having such doc in place that these things might change is enough ?

Thanks @Cheappie -- I think the comments are good enough

In fact it is perhaps good to see what was required to be made pub to use these new APIs. Hopefully it makes downstream integrations easier

alamb · 2022-08-09T15:29:13Z

Looks like there are some minor merge conflicts and then I think this PR is ready to go. Thanks again @tustvold and @Cheappie

Disable where_clauses_object_safety

alamb · 2022-08-09T16:18:39Z

I took the liberty of updating the RAT in e18d26f to fix CI https://github.com/apache/arrow-datafusion/runs/7750069190?check_suite_focus=true

alamb · 2022-08-09T17:05:15Z

Looking into the Clippy failure....

Cheappie · 2022-08-09T17:36:42Z

I took the liberty of updating the RAT in e18d26f to fix CI https://github.com/apache/arrow-datafusion/runs/7750069190?check_suite_focus=true

thanks

Looking into the Clippy failure....

@alamb I forgot to add #![allow(where_clauses_object_safety)] at the top of custom_parquet_reader.rs file, would you like to do that or would you prefer me adding that line ? It is the same problem with AsyncFileLoader as we have in other places, that needs upstream change that has been already addressed by @tustvold here apache/arrow-rs#2372.

Cheappie · 2022-08-09T17:55:23Z

this time clippy should be happy, fingers crossed

alamb · 2022-08-09T19:40:04Z

this time clippy should be happy, fingers crossed

Thank you for sticking with this @Cheappie

ursabot · 2022-08-09T20:42:11Z

Benchmark runs are scheduled for baseline = 01202d6 and contender = 9956f80. 9956f80 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

add metadata_ext to part file, etc

db4e632

github-actions bot added the core Core DataFusion crate label Aug 6, 2022

Cheappie added 5 commits August 6, 2022 11:57

handle error

12dee5e

fix compilation issues

926d25d

rename var

4e9eeba

fix tests compilation issues

cab87e8

allow user to provide their own parquet reader factory

cecfa0a

Cheappie added 2 commits August 6, 2022 21:03

move ThinFileReader into parquet module

124eeb9

rename field reader to delegate

793d252

tustvold approved these changes Aug 8, 2022

View reviewed changes

Cheappie added 6 commits August 8, 2022 11:56

convert if else to unwrap or else

8994e0c

rename ThinFileReader to BoxedAsyncFileReader, add doc

cd1e532

hide ParquetFileMetrics

ef4e30e

derive debug

cfb168f

convert metadata_ext field into Any type

8c6d702

add builder like method instead of modifying ctor

10c12da

tustvold approved these changes Aug 8, 2022

View reviewed changes

Cheappie changed the title ~~[WIP] Allow Overriding AsyncFileReader used by ParquetExec~~ Allow Overriding AsyncFileReader used by ParquetExec Aug 8, 2022

make ParquetFileReaderFactory public to let user's provide custom i…

e6e50c2

…mplementations

imports cleanup and more docs

6b33ea9

try fix clippy failures

2f2079a

tustvold mentioned this pull request Aug 8, 2022

Implement AsyncFileReader for Box<dyn AsyncFileReader> apache/arrow-rs#2368

Merged

This was referenced Aug 8, 2022

AsyncFileReaderNo Longer Object-Safe apache/arrow-rs#2372

Closed

Re-enable where_clauses_object_safety #3081

Closed

Disable where_clauses_object_safety

cb96ca7

alamb approved these changes Aug 9, 2022

View reviewed changes

add test

e387d23

alamb reviewed Aug 9, 2022

View reviewed changes

extract ParquetFileReaderFactory test to integration tests

bc66f6a

Cheappie and others added 4 commits August 9, 2022 17:34

resolve conflicts

334edce

further cleanup

30d664a

Merge pull request #1 from tustvold/fix-send

9c2ccc3

Disable where_clauses_object_safety

fix: Add apache RAT license

e18d26f

fix send

cec773f

alamb merged commit 9956f80 into apache:master Aug 9, 2022

waitingkuo mentioned this pull request Aug 10, 2022

Follow-up on Clickbench benchmark #3048

Closed

15 tasks

thinkharderdev mentioned this pull request Aug 16, 2022

Parallel fetching of column chunks when reading parquet files #2949

Closed

tustvold mentioned this pull request Mar 24, 2023

Allow Overriding AsyncFileReader used by ParquetExec #2992

Closed

Allow Overriding AsyncFileReader used by ParquetExec #3051

Allow Overriding AsyncFileReader used by ParquetExec #3051

Conversation

Cheappie commented Aug 6, 2022

codecov-commenter commented Aug 6, 2022 • edited Loading

Codecov Report

tustvold left a comment

Choose a reason for hiding this comment

tustvold Aug 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Aug 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

Cheappie commented Aug 8, 2022 • edited Loading

tustvold commented Aug 8, 2022

Cheappie commented Aug 8, 2022

tustvold commented Aug 8, 2022 • edited Loading

Cheappie commented Aug 8, 2022 • edited Loading

tustvold commented Aug 8, 2022

tustvold commented Aug 8, 2022

tustvold commented Aug 8, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Aug 9, 2022

Cheappie commented Aug 9, 2022 • edited Loading

alamb commented Aug 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Aug 9, 2022

alamb commented Aug 9, 2022

alamb commented Aug 9, 2022

Cheappie commented Aug 9, 2022 • edited Loading

Cheappie commented Aug 9, 2022 • edited Loading

alamb commented Aug 9, 2022

ursabot commented Aug 9, 2022

codecov-commenter commented Aug 6, 2022 •

edited

Loading

tustvold Aug 8, 2022 •

edited

Loading

tustvold Aug 8, 2022 •

edited

Loading

Cheappie commented Aug 8, 2022 •

edited

Loading

tustvold commented Aug 8, 2022 •

edited

Loading

Cheappie commented Aug 8, 2022 •

edited

Loading

Cheappie commented Aug 9, 2022 •

edited

Loading

Cheappie commented Aug 9, 2022 •

edited

Loading

Cheappie commented Aug 9, 2022 •

edited

Loading