Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Async sink interfaces for IPC and Parquet IO #876

Closed
wants to merge 15 commits into from

Conversation

sydduckworth
Copy link
Contributor

Following up on discussion here: jorgecarleitao/parquet2#78.

This PR implements new types that expose the futures::Sink trait for Arrow IPC and Parquet writers. It also implements an async stream type for reading the IPC file format (since currently async reading is only available for the IPC stream format). Specific new types:

  • io::ipc::read::file_async::FileStream - implements futures::Stream for IPC files.
  • io::ipc::write::file_async::FileSink - implements futures::Sink for IPC files.
  • io::ipc::write::file_async::StreamSink - implements futures::Sink for IPC streams.
  • io::parquet::write::FileSink - implements futures::Sink for Parquet files.

Definitely interested in feedback on these changes!
In particular I felt pretty unsure about file structure and naming for the new types so happy to change those in any way that makes sense. I also added tests directly into some of the modules, but I wasn't sure if there was a different place I should be putting those.

@codecov
Copy link

codecov bot commented Mar 2, 2022

Codecov Report

Merging #876 (9e54566) into main (eb4bc5d) will increase coverage by 0.14%.
The diff coverage is 67.49%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #876      +/-   ##
==========================================
+ Coverage   71.50%   71.65%   +0.14%     
==========================================
  Files         335      338       +3     
  Lines       18147    18474     +327     
==========================================
+ Hits        12976    13237     +261     
- Misses       5171     5237      +66     
Impacted Files Coverage Δ
src/io/ipc/write/mod.rs 92.30% <ø> (ø)
src/io/parquet/write/mod.rs 61.08% <ø> (ø)
src/io/ipc/read/file_async.rs 57.03% <57.03%> (ø)
src/io/ipc/write/stream_async.rs 72.72% <71.15%> (+17.17%) ⬆️
src/io/ipc/write/file_async.rs 74.69% <74.69%> (ø)
src/io/parquet/write/sink.rs 76.27% <76.27%> (ø)
src/io/ipc/read/reader.rs 71.42% <100.00%> (ø)
src/bitmap/utils/slice_iterator.rs 87.93% <0.00%> (+1.72%) ⬆️
src/io/parquet/write/file.rs 76.00% <0.00%> (+8.00%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb4bc5d...9e54566. Read the comment docs.

@jorgecarleitao
Copy link
Owner

Holy molly amazing PR ❤️

IMO this is basically ready to merge; the API looks great, it is ready easy to follow it, and overall really great work.

Some comments:

  • In the parquet, we have a FileStreamer. Could you comment what is the difference / whether we can replace that by this? If they are solving the same problem, imo we should just remove the other one. ^^
  • Could you move all the tests to tests/it/io/*? This crate has most of its tests outside the main crate, following https://matklad.github.io/2021/02/27/delete-cargo-integration-tests.html
  • Could you split this PR in two: parquet and IPC; imo they deserve two entries in the change log and corresponding PRs.

@sydduckworth
Copy link
Contributor Author

Sure, I'll split it up, move the tests, and re-submit!

FileSink essentially does the same thing as FileStreamer, so it could replace FileStreamer if we're okay with the breaking change to the interface.

@jorgecarleitao
Copy link
Owner

Thanks a lot, @dexterduck 🙇

yes, for now it is ok to break APIs in io/ - you found/proposed a better one; let's use it :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants