Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Method to lazily read a collection of multiple Arrow IPC stream files #44561

Open
ianmcook opened this issue Oct 29, 2024 · 1 comment

Comments

@ianmcook
Copy link
Member

ianmcook commented Oct 29, 2024

Describe the enhancement requested

It would be nice to have some method available in PyArrow to lazily read a collection of Arrow IPC stream files. This would be a great complement to the Arrow over HTTP project, because a common use case is for the user to download multiple Arrow IPC stream files from the HTTP server and then read them into Python.

The dataset API works with files in the Arrow IPC file format, but it does not currently work with files in the Arrow IPC stream format.

Also, as far as I can tell, it is not currently possible to directly create a record batch stream reader from a collection of multiple Arrow IPC stream files with the same schema.

Component(s)

Python

@ianmcook
Copy link
Member Author

ianmcook commented Oct 29, 2024

It's easy enough to create a record batch reader from a collection of multiple Arrow IPC stream files that have the same schema using code like this:

import pyarrow as pa
import glob

def get_schema(paths):
    with open(path, "rb") as file:
        reader = pa.ipc.open_stream(file)
        return reader.schema

def get_batches(paths):
    for path in paths:
        with pa.memory_map(path) as file:  # or use: open(path, "rb") 
            reader = pa.ipc.open_stream(file)
            for batch in reader:
                yield batch

paths = sorted(glob.glob("*.arrows"))

reader = pa.ipc.RecordBatchStreamReader.from_batches(
    get_schema(paths),
    get_batches(paths)
)

I can confirm based on testing that this works lazily. It doesn't read any of the record batches into memory. To read the record batches into memory, you call reader.read_next_batch() or reader.read_all() after the above.

Reading the batches will typically be faster if you use open(path, "rb") instead of pa.memory_map(path) in the definition of get_batches, but the tradeoff is that it uses a lot more memory.

Regardless of this, it would be nice to have a method in PyArrow that expresses this more concisely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant