[Python] Method to lazily read a collection of multiple Arrow IPC stream files #44561

ianmcook · 2024-10-29T14:36:20Z

Describe the enhancement requested

It would be nice to have some method available in PyArrow to lazily read a collection of Arrow IPC stream files. This would be a great complement to the Arrow over HTTP project, because a common use case is for the user to download multiple Arrow IPC stream files from the HTTP server and then read them into Python.

The dataset API works with files in the Arrow IPC file format, but it does not currently work with files in the Arrow IPC stream format.

Also, as far as I can tell, it is not currently possible to directly create a record batch stream reader from a collection of multiple Arrow IPC stream files with the same schema.

Component(s)

Python

ianmcook · 2024-10-29T16:26:36Z

It's easy enough to create a record batch reader from a collection of multiple Arrow IPC stream files that have the same schema using code like this:

import pyarrow as pa
import glob

def get_schema(paths):
    with open(path, "rb") as file:
        reader = pa.ipc.open_stream(file)
        return reader.schema

def get_batches(paths):
    for path in paths:
        with pa.memory_map(path) as file:  # or use: open(path, "rb") 
            reader = pa.ipc.open_stream(file)
            for batch in reader:
                yield batch

paths = sorted(glob.glob("*.arrows"))

reader = pa.ipc.RecordBatchStreamReader.from_batches(
    get_schema(paths),
    get_batches(paths)
)

I can confirm based on testing that this works lazily. It doesn't read any of the record batches into memory. To read the record batches into memory, you call reader.read_next_batch() or reader.read_all() after the above.

Reading the batches will typically be faster if you use open(path, "rb") instead of pa.memory_map(path) in the definition of get_batches, but the tradeoff is that it uses a lot more memory.

Regardless of this, it would be nice to have a method in PyArrow that expresses this more concisely.

ianmcook added the Type: enhancement label Oct 29, 2024

github-actions bot added the Component: Python label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Method to lazily read a collection of multiple Arrow IPC stream files #44561

[Python] Method to lazily read a collection of multiple Arrow IPC stream files #44561

ianmcook commented Oct 29, 2024 •

edited

Loading

ianmcook commented Oct 29, 2024 •

edited

Loading

[Python] Method to lazily read a collection of multiple Arrow IPC stream files #44561

[Python] Method to lazily read a collection of multiple Arrow IPC stream files #44561

Comments

ianmcook commented Oct 29, 2024 • edited Loading

Describe the enhancement requested

Component(s)

ianmcook commented Oct 29, 2024 • edited Loading

ianmcook commented Oct 29, 2024 •

edited

Loading

ianmcook commented Oct 29, 2024 •

edited

Loading