-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow Overriding AsyncFileReader used by ParquetExec #3051
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3051 +/- ##
==========================================
+ Coverage 85.93% 85.95% +0.01%
==========================================
Files 289 290 +1
Lines 52118 52243 +125
==========================================
+ Hits 44788 44903 +115
- Misses 7330 7340 +10
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good step forward, left some minor nits
} | ||
} | ||
|
||
struct ThinFileReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably fix the need for this upstream. In the meantime perhaps we could called this BoxedAsyncFileReader
to make clear what it is for.
Might also make it a tuple struct to hammer home the fact it is just a newtype wrapper. Doc comment would also help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, by the way do you know any better way to solve that than for example making upstream accept Box<dyn AsyncFileReader + Send>
in constructor ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within parquet
we could impl AsyncFileReader for Box<dyn AsyncFileReader>
, but tbh I wonder if we should just type-erase by default 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix in apache/arrow-rs#2368
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you 👍
Thank you @tustvold for code review, It helped me a lot to make code nicer. |
Failing clippy lint, then this can go in |
I have added fixes, pr might pass clippy lints now. But It might also fail, because compiler reports that there is something wrong with
|
Oops, let me fix that upstream. In the meantime can you add an Edit: Upstream fix apache/arrow-rs#2366 |
I have added allows in few places, related to where the issue originates and some top levels, but compiler can't get silent about that. I have tried such allow at the top of the file too By the way compiler points to parquet crate, |
Let me have a play and see if I can find some way to placate clippy |
So sticking I'll have a chat with @alamb and see what would be the circumstances for a patch release |
Hacky workaround in Cheappie#1 @alamb if you could give this a once over when you have a moment, this is the least terrible way I have found to make this work... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
My biggest recommendation is to add a end-to-end example (like https://github.com/apache/arrow-datafusion/blob/master/datafusion-examples/examples/custom_datasource.rs#L160) for two reasons:
- Possibly help others use this feature
- (More important) ensure that the plumbing (especially the user defined extensions, for example) continues to work. I think this code is likely to undergo some significant churn over the next few months as we push more predicates down closer to parquet and without some end to end test we may end up breaking something inadvertently
cc @thinkharderdev and @yjshen
@@ -401,6 +404,33 @@ fn create_dict_array( | |||
)) | |||
} | |||
|
|||
/// A single file or part of a file that should be read, along with its schema, statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am missing something but I don't see "schema" and "statistics" on this struct
Perhaps we should refer to "extensions" instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description has been copied from PartitionedFile, at first I had similar impression that something is wrong, but after second read actually It doesn't tell that schema and statistics are part of the struct, but rather that these should be read from file along with data
@@ -401,6 +404,33 @@ fn create_dict_array( | |||
)) | |||
} | |||
|
|||
/// A single file or part of a file that should be read, along with its schema, statistics | |||
pub struct FileMeta { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This struct makes a lot of sense to me
fn create_reader( | ||
&self, | ||
partition_index: usize, | ||
file_meta: FileMeta, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the file meta contains user defined extensions too, right? So users can pass information down into their custom readers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, exactly that. But afaik they will be able to achieve that only through custom TableProvider. Because extensions are passed through PartitionedFile into FileMeta, where higher level abstraction like ObjectStore
produces ObjectMeta in listing operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
|
||
/// | ||
/// BoxedAsyncFileReader has been created to satisfy type requirements of | ||
/// parquet stream builder constructor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary after apache/arrow-rs#2366? If not perhaps we can add a reference here so we can eventually clean it up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apache/arrow-rs#2368 is the one that will remove the need for this
Here is another example of a test that ensures user defined plan nodes don't get messed up: https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/tests/user_defined_plan.rs (this has saved me several times from breaking them downstream) |
I have added test that checks whether data access ops are routed to parquet file reader factory, It seems that It works as expected |
The only other thing I would suggest is making it a "cargo integration test" by putting it in |
} | ||
|
||
#[derive(Debug)] | ||
struct InMemoryParquetFileReaderFactory(Arc<dyn ObjectStore>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test looks great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved that test to integration tests, but for users convenience I had to expose two things ParquetFileMetrics
and fn fetch_parquet_metadata(...)
. I have added docs that describe that these things are subjects to change in near future and are exposed for low level integrations. Should we worry about exposing impl details, or having such doc in place that these things might change is enough ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Cheappie -- I think the comments are good enough
In fact it is perhaps good to see what was required to be made pub
to use these new APIs. Hopefully it makes downstream integrations easier
Disable where_clauses_object_safety
I took the liberty of updating the RAT in e18d26f to fix CI https://github.com/apache/arrow-datafusion/runs/7750069190?check_suite_focus=true |
Looking into the Clippy failure.... |
thanks
@alamb I forgot to add |
this time clippy should be happy, fingers crossed |
Thank you for sticking with this @Cheappie |
Benchmark runs are scheduled for baseline = 01202d6 and contender = 9956f80. 9956f80 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
First steps towards overriding AsyncFileReader used by ParquetExec
#2992