You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be nice to have some method available in PyArrow to lazily read a collection of Arrow IPC stream files. This would be a great complement to the Arrow over HTTP project, because a common use case is for the user to download multiple Arrow IPC stream files from the HTTP server and then read them into Python.
The dataset API works with files in the Arrow IPC file format, but it does not currently work with files in the Arrow IPC stream format.
Also, as far as I can tell, it is not currently possible to directly create a record batch stream reader from a collection of multiple Arrow IPC stream files with the same schema.
Component(s)
Python
The text was updated successfully, but these errors were encountered:
I can confirm based on testing that this works lazily. It doesn't read any of the record batches into memory. To read the record batches into memory, you call reader.read_next_batch() or reader.read_all() after the above.
Reading the batches will typically be faster if you use open(path, "rb") instead of pa.memory_map(path) in the definition of get_batches, but the tradeoff is that it uses a lot more memory.
Regardless of this, it would be nice to have a method in PyArrow that expresses this more concisely.
Describe the enhancement requested
It would be nice to have some method available in PyArrow to lazily read a collection of Arrow IPC stream files. This would be a great complement to the Arrow over HTTP project, because a common use case is for the user to download multiple Arrow IPC stream files from the HTTP server and then read them into Python.
The dataset API works with files in the Arrow IPC file format, but it does not currently work with files in the Arrow IPC stream format.
Also, as far as I can tell, it is not currently possible to directly create a record batch stream reader from a collection of multiple Arrow IPC stream files with the same schema.
Component(s)
Python
The text was updated successfully, but these errors were encountered: