Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable object_store reading for all the file types #6177

Open
winding-lines opened this issue Jan 11, 2023 · 3 comments
Open

Enable object_store reading for all the file types #6177

winding-lines opened this issue Jan 11, 2023 · 3 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@winding-lines
Copy link
Contributor

Problem description

Right now the object_store crate is integrated for reading on the parquet streaming path. Enable cloud url reading for all the file types.

@winding-lines winding-lines added the enhancement New feature or an improvement of an existing feature label Jan 11, 2023
@chitralverma
Copy link
Contributor

@winding-lines any PR open/ updates for this ?

@winding-lines
Copy link
Contributor Author

@chitralverma I am still working on #6830, to fully integrate async in python. After some dead-ends I see the possible architecture that will marry the thread-heavy code in Polars with the async capabilities. See today's comment for my current thinking.

@josevalim
Copy link

josevalim commented Apr 12, 2023

@winding-lines we have also been thinking about this problem. In particular, we are trying to understand what needs to be done by Polars itself for performance reasons and what could be done externally to avoid adding too much to Polars.

Here is a table summarizing our understanding so far (✓ means supported, ? means unsupported, x means Polars does not need to support it):

Read format eager file lazy file lazy s3
csv x
parquet
ipc ?
ipc stream ? x
ndjson x

The reason why csv, ipc_stream, and ndjson does not need to support "lazy S3" within Polars is because those formats need to be fully loaded upfront. So we might as well download them to disk (or memory) and use the other existing APIs. Also note I didn't list "eager s3" because that's equivalent to calling the lazy version followed by a collect() call.


We are also interested in streaming to object storage, both parquet and ipc. See #6178. For write operations we have this table:

Write format eager file eager S3 streaming file streaming s3
csv ? x x
parquet ? ?
ipc ? ?
ipc stream ? x x
ndjson ? x x

I believe the "eager S3" operations can be implemented today using the underlying {Format}Writer APIs. And, once again, it doesn't make sense to stream csv, ipc_stream, and ndjson, as they are row based, so using the eager versions is enough.


So, according to our (limited) understanding, the pending operations which must happen on the Polars side are not that many. Please let me know if I missed anything. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants