Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16430: [Python] Add support for reading record batch custom metadata API #13041

Merged
merged 4 commits into from
Dec 12, 2022

Conversation

niyue
Copy link
Contributor

@niyue niyue commented May 1, 2022

In ARROW-16131, C++ APIs were added so that users can read/write record batch custom metadata for IPC file. In this PR, pyarrow APIs are added so that python users can take advantage of these APIs to address ARROW-16430.

@github-actions
Copy link

github-actions bot commented May 1, 2022

@github-actions
Copy link

github-actions bot commented May 1, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

cpp/src/arrow/ipc/writer.h Outdated Show resolved Hide resolved
@niyue niyue force-pushed the feature/py-batch-custom-meta branch from 1a7e908 to 620ff15 Compare May 1, 2022 07:06
@niyue niyue force-pushed the feature/py-batch-custom-meta branch 3 times, most recently from 34b556a to 38e45dd Compare May 1, 2022 08:59
@kou kou changed the title ARROW-16430, support reading record batch custom metadata API in pyarrow ARROW-16430: [Python] Add support for reading record batch custom metadata API May 6, 2022
@niyue niyue force-pushed the feature/py-batch-custom-meta branch from 38e45dd to c1fb6ad Compare May 7, 2022 01:35
python/pyarrow/_flight.pyx Outdated Show resolved Hide resolved
python/pyarrow/ipc.pxi Outdated Show resolved Hide resolved
Comment on lines 939 to 940
batch : RecordBatch
custom_metadata: dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Is this the right Numpydoc syntax when a namedtuple is returned?

@pitrou
Copy link
Member

pitrou commented May 10, 2022

By the way, this PR addresses RecordBatchFileReader but not RecordBatchReader. Do you plan to do that as well?

@niyue niyue force-pushed the feature/py-batch-custom-meta branch from c1fb6ad to 78f492f Compare May 12, 2022 05:41
@niyue
Copy link
Contributor Author

niyue commented May 12, 2022

By the way, this PR addresses RecordBatchFileReader but not RecordBatchReader. Do you plan to do that as well?

I tried adding the pyarrow API for RecordBatchReader locally previously, but I found if I want to test the new pyarrow API for RecordBatchReader, I probably have to add the corresponding API implementation in C++ RecordBatchStreamReader so that I can unit test it in a meaningful way since it seems only RecordBatchStreamReader and RecordBatchFileReader provides the capability to read record batch custom metadata (is this correct?). And since this PR is more about adding new pyarrow API for existing C++ capability, I would like to make non necessary C++ API change less in this PR.

  1. if there is other way to unit test pyarrow RecordBatchReader.read_next_batch_with_metadata API besides implementing it in RecordBatchStreamReader?
  2. do you recommend to put all these changes together in single PR or keep them as separated PRs? I initially submitted this issue as a pyarrow enhancement PR, but I am okay to include more API changes if it is preferred.

@niyue niyue force-pushed the feature/py-batch-custom-meta branch from 78f492f to 3f040a0 Compare May 12, 2022 07:24
@niyue
Copy link
Contributor Author

niyue commented May 17, 2022

@pitrou I made some additional changes to the PR, could you please help to review? Thanks.

@pitrou
Copy link
Member

pitrou commented May 18, 2022

@niyue Sorry for the late replay, but the C++ RecordBatchReader already has a method Result<RecordBatchWithMetadata> ReadNext(), so you wouldn't need to add anything more to the C++ side AFAICT?

@pitrou
Copy link
Member

pitrou commented Aug 9, 2022

@niyue Are you planning to work on this?
Otherwise @milesgranger you might be interested in taking this up?

@milesgranger
Copy link
Contributor

I can certainly pick it up if needed!
Presently, I'm working locally on ARROW-13763.

@niyue
Copy link
Contributor Author

niyue commented Aug 9, 2022

@niyue Are you planning to work on this?

Otherwise @milesgranger you might be interested in taking this up?

Hi @pitrou, sorry I got too busy since last commit and forgot this issue previously. I am still interested working on this, and can start working on it from next week.

@niyue niyue force-pushed the feature/py-batch-custom-meta branch from 3f040a0 to a676680 Compare August 17, 2022 06:24
@pitrou
Copy link
Member

pitrou commented Aug 17, 2022

@niyue If/when this is ready for review, please say so :-)

@niyue
Copy link
Contributor Author

niyue commented Aug 17, 2022

@pitrou I pushed a new commit a minute ago. I followed your suggestion to add corresponding API for RecordBatchReader API in pyarrow, and a basic unit test to cover this API. I ran all pyarrow tests (using the minimum build only) and all tests passed locally. Could you please help to review? Thanks.

@niyue
Copy link
Contributor Author

niyue commented Sep 8, 2022

@pitrou I am not sure if you see my previous comment. I made some change previously, and this PR is ready for review, could you please help? Thanks.

@pitrou
Copy link
Member

pitrou commented Sep 12, 2022

@niyue Sorry, I had overlooked it. I'll take a look when I can. @jorisvandenbossche would you like to review this too?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! Here are some more comments.

python/pyarrow/ipc.pxi Outdated Show resolved Hide resolved
python/pyarrow/ipc.pxi Outdated Show resolved Hide resolved
python/pyarrow/ipc.pxi Outdated Show resolved Hide resolved
python/pyarrow/ipc.pxi Outdated Show resolved Hide resolved
python/pyarrow/ipc.pxi Outdated Show resolved Hide resolved
python/pyarrow/ipc.pxi Outdated Show resolved Hide resolved
python/pyarrow/tests/test_ipc.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_ipc.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_ipc.py Outdated Show resolved Hide resolved
python/pyarrow/_flight.pyx Show resolved Hide resolved
@niyue
Copy link
Contributor Author

niyue commented Sep 22, 2022

No problem. I will check them out and see if I can get them addressed. Thanks for the review.

@niyue niyue force-pushed the feature/py-batch-custom-meta branch 2 times, most recently from 6498958 to ef95d59 Compare September 27, 2022 10:18
python/pyarrow/ipc.pxi Outdated Show resolved Hide resolved
python/pyarrow/tests/test_ipc.py Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Oct 5, 2022

There were some timeouts in some CI jobs. I've restarted them just in case. If that still reproduces, should first rebase from master to catch up with any upstream fixes.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just two minor doc formatting comments

python/pyarrow/ipc.pxi Outdated Show resolved Hide resolved
python/pyarrow/ipc.pxi Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

This looks like a nice addition. Something I am wondering more in general (potentially for future JIRA/PRs), when working with custom metadata, would it be useful to also allow to inspect the custom metadata of a certain batch, without also loading the batch? (so you could for example check the metadata before deciding whether to read the batch or not).
I suppose that would only make sense for the IPC file format? (although not sure if that makes sense at the format / C++ level) Like the RecordBatchFileReader now has a get_batch(i)/ get_batch_with_custom_metadata(i), one could also have a get_custom_metadata(i)?

@pitrou
Copy link
Member

pitrou commented Oct 5, 2022

Something I am wondering more in general (potentially for future JIRA/PRs), when working with custom metadata, would it be useful to also allow to inspect the custom metadata of a certain batch, without also loading the batch?

Perhaps, but that would have to be done on the C++ side first. Also, I would wait for users to actually request it.

@niyue niyue force-pushed the feature/py-batch-custom-meta branch from ef95d59 to af3cd34 Compare October 10, 2022 06:13
@niyue
Copy link
Contributor Author

niyue commented Oct 10, 2022

@pitrou sorry for the late response. There are two failed CI builds, and they don't seem relevant in this issue. I rebased onto the latest master branch anyway.

…w so that pyarrow can read record batch along with its custom metadata.
@niyue niyue force-pushed the feature/py-batch-custom-meta branch from af3cd34 to 8f5d119 Compare October 10, 2022 06:20
@niyue
Copy link
Contributor Author

niyue commented Nov 2, 2022

@pitrou could you please help to review the PR to see if there is still anything we desire to change? Thanks.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really sorry for the delay @niyue ! I've pushed a couple minor changes and will merge if CI is green.

Thank you for the work!

@pitrou pitrou merged commit 45d83fe into apache:master Dec 12, 2022
@niyue
Copy link
Contributor Author

niyue commented Dec 13, 2022

No problem @pitrou . Thanks for the review and the fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants