Enable object_store reading for all the file types #6177

winding-lines · 2023-01-11T16:36:46Z

Problem description

Right now the object_store crate is integrated for reading on the parquet streaming path. Enable cloud url reading for all the file types.

The text was updated successfully, but these errors were encountered:

chitralverma · 2023-02-21T23:50:14Z

@winding-lines any PR open/ updates for this ?

winding-lines · 2023-02-27T13:49:54Z

@chitralverma I am still working on #6830, to fully integrate async in python. After some dead-ends I see the possible architecture that will marry the thread-heavy code in Polars with the async capabilities. See today's comment for my current thinking.

josevalim · 2023-04-12T16:41:21Z

@winding-lines we have also been thinking about this problem. In particular, we are trying to understand what needs to be done by Polars itself for performance reasons and what could be done externally to avoid adding too much to Polars.

Here is a table summarizing our understanding so far (✓ means supported, ? means unsupported, x means Polars does not need to support it):

Read format	eager file	lazy file	lazy s3
csv	✓	✓	x
parquet	✓	✓	✓
ipc	✓	✓	?
ipc stream	✓	?	x
ndjson	✓	✓	x

The reason why csv, ipc_stream, and ndjson does not need to support "lazy S3" within Polars is because those formats need to be fully loaded upfront. So we might as well download them to disk (or memory) and use the other existing APIs. Also note I didn't list "eager s3" because that's equivalent to calling the lazy version followed by a collect() call.

We are also interested in streaming to object storage, both parquet and ipc. See #6178. For write operations we have this table:

Write format	eager file	eager S3	streaming file	streaming s3
csv	✓	?	x	x
parquet	✓	?	✓	?
ipc	✓	?	✓	?
ipc stream	✓	?	x	x
ndjson	✓	?	x	x

I believe the "eager S3" operations can be implemented today using the underlying {Format}Writer APIs. And, once again, it doesn't make sense to stream csv, ipc_stream, and ndjson, as they are row based, so using the eager versions is enough.

So, according to our (limited) understanding, the pending operations which must happen on the Polars side are not that many. Please let me know if I missed anything. Thank you!

winding-lines added the enhancement New feature or an improvement of an existing feature label Jan 11, 2023

MatthiasRoels mentioned this issue May 31, 2023

Support for datasets in cloud object stores #9124

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable object_store reading for all the file types #6177

Enable object_store reading for all the file types #6177

winding-lines commented Jan 11, 2023

chitralverma commented Feb 21, 2023

winding-lines commented Feb 27, 2023

josevalim commented Apr 12, 2023 •

edited

Loading

Enable object_store reading for all the file types #6177

Enable object_store reading for all the file types #6177

Comments

winding-lines commented Jan 11, 2023

Problem description

chitralverma commented Feb 21, 2023

winding-lines commented Feb 27, 2023

josevalim commented Apr 12, 2023 • edited Loading

josevalim commented Apr 12, 2023 •

edited

Loading