-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(rust,python): Enable object store in scan_parquet python #6426
feat(rust,python): Enable object store in scan_parquet python #6426
Conversation
8e63d30
to
d16c4db
Compare
Thanks a lot for the PR @winding-lines. I hope to get to this one tomorrow. 💪 |
Sorry @ritchie46, can you wait until the weekend/next week? The code is working but the speedup is not significant - 6 seconds faster on this 42 second test that I am using. It may well be that the speedup will not be significant when done and that the only benefit is that we can make the planner more efficient, but I want to push a bit more. |
d16c4db
to
58cafde
Compare
@ritchie46 here is an update on this PR:
This Dask thread dask/community#234 (comment) has some good pointers on what we could do next. The general idea is to load the metadata from a centralized system, and for a general library as There is a secondary long term design question: assuming that 1 year from now |
Great work @winding-lines! Some unsolicited feedback on the function signature, what do you think of making it a little more generic e.g.
or
If the storage_handler (or maybe storage_backend) param is missing, it defaults to fsspec. |
It's not a part of the parquet spec, but systems that write directories of Parquet files often produce both
This is especially helpful because you can read the On the Rust side, |
pub fn scan_parquet_async( | ||
uri: String, | ||
n_rows: Option<usize>, | ||
cache: bool, | ||
parallel: polars_io::parquet::ParallelStrategy, | ||
row_count: Option<RowCount>, | ||
rechunk: bool, | ||
low_memory: bool, | ||
cloud_options: Option<CloudOptions>, | ||
) -> BoxFuture<'static, PolarsResult<Self>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that this would also likely be much faster once jorgecarleitao/parquet2#159 is implemented. Parquet stores its metadata in the footer of the file, and the last bytes of the file describe how many bytes the metadata take up. Right now arrow2/parquet2 require a HEAD
request first to get the content length of the parquet file and then try to load the metadata. But this would be faster by:
- Fetching the last
N
bytes of a Parquet file using a range request, whereN
is something like 64kb which often includes the entire metadata. - Checking the last 4 bytes of the buffer to get the length of the metadata
- If
N
wasn't enough bytes, do one more fetch to get all the metadata. - Then parse the metadata
Maybe I'll find some time to try and make a PR for that in parquet2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylebarron I am working on a native polars parquet reader. Hope we can include this.
2df2697
to
c0cb61a
Compare
Current status:
|
dbbf410
to
5af7516
Compare
5af7516
to
b46b979
Compare
@ritchie46 this is ready for review. I am happy to change the API as per @talawahtech 's great feedback, if you agree with it. I am not yet familiar with the python API you desire for Polars :) |
]) | ||
.with_predicate_pushdown(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All those optimizations are default. We don't have to specify them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks for letting me know.
// List all the files in the bucket. | ||
let files = async_glob(url, cloud_options.as_ref()).await?; | ||
if files.len() > 10 { | ||
println!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not print the stdout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. The intent is to give some feedback for what is likely a slow operation.
I changed this to stderr, but if this is not acceptable I am happy to remove.
pub fn scan_parquet_async( | ||
uri: String, | ||
n_rows: Option<usize>, | ||
cache: bool, | ||
parallel: polars_io::parquet::ParallelStrategy, | ||
row_count: Option<RowCount>, | ||
rechunk: bool, | ||
low_memory: bool, | ||
cloud_options: Option<CloudOptions>, | ||
) -> BoxFuture<'static, PolarsResult<Self>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylebarron I am working on a native polars parquet reader. Hope we can include this.
.unwrap_or_else(|| read::read_metadata(&mut reader))?; | ||
let reader = ReaderBytes::from(&reader); | ||
let bytes = reader.deref(); | ||
let column_store = mmap::ColumnStore::Local(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ColumnStore
is a good name. Can't local
store whole row-groups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ritchie46 thanks for looking at the PR and I agree it's not a good name. To specifically answer you question, in local
mode we don't actually store anything, it's just a pass through to the mmap-ed file.
MORE CONTEXT (sorry this is a bit long)
We are a bit in a tight spot:
- when running locally we want to use mmap
- when running remote we want to download only the parts of the file of interest (footer, metadata and then columns - this can be optimized a bit as per one of the comments)
- we currently want the
async
code to be as self-contained as possible
The ColumnStore
is an abstraction that I introduced in an earlier PR to allow us to keep most of the code as-is and specially limit the spread of async
. To achieve this this layer provides a level of indirection where the reader uses linear addressing, so the code stays as is.
For the local use case the ColumnStore is pass through:
- it maps the address requested by the reader to the actual address in the file
- the OS is doing its magic
- the code that you are looking at here is a no-op.
For the remote use case:
- we pre-download the regions of the file that we know will be accessed later on.
- This allows us to go in async mode for the period of the download and then come back to non-async mode.
- All the downloaded areas get store in a hashmap.
- When the reader executes and accesses an address we return the downloaded buffer.
I see a couple of ways forward:
- allow the reader to run in async mode and actually do the download in the reader when the region of the file is needed
- keep the current async island but come with a better name for the CloudStore
#1 is a bit of a departure from the current code architecture and you seem to prefer threads to async. Over the long run it may be inevitable though since we need to teach the planner the cost of operations.
If you want to go with #2 I could rename this to ColumMapper or CloudMapper, or some other name that resonates more with you.
Let me know what you think 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay (really busy). Great explanation. Could we add this as comments on top of the ColumnStore
. Maybe we call it CloudMapper
indeed.
I think we should first go for #2. I see some opportunities for some async
in the streaming engine. But I don't want to overhaul our design and gradually experiment with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot Ritchie! Let's chain this PR on PR-6830. After we evaluate and possibly land PR-6830 we can come here and do a better job with the integration. This code will be much simpler if we have a Tokio runtime available when making it down to the download layer.
3cb1df1
to
ff360a6
Compare
ff360a6
to
fb5502b
Compare
I'm closing this pull request due to inactivity. Feel free to rebase and reopen and continue your work! For reference, this is now possible in Python through |
This PR enables the object_store downloads in python when using scan_parquet. This is opt-in, and
"object_store":True
needs to be passed in thestorage_options
parameter.For example