Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Add support to read from Apache ORC #759

Closed
jorgecarleitao opened this issue Jan 12, 2022 · 4 comments
Closed

Add support to read from Apache ORC #759

jorgecarleitao opened this issue Jan 12, 2022 · 4 comments
Labels
investigation Issues or PRs that are investigations. Prs may or may not be merged. no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog

Comments

@jorgecarleitao
Copy link
Owner

The core development for this is being carried out here: https://github.com/jorgecarleitao/orc-rs. The hope is that once we can read stripes there, we can plug that here and deserialize to arrow, just like we do for parquet.

PRs over there are of course very welcome 🙇

@jorgecarleitao jorgecarleitao added the investigation Issues or PRs that are investigations. Prs may or may not be merged. label Jan 12, 2022
@Igosuki
Copy link
Contributor

Igosuki commented Jan 12, 2022

AFAIK Orc is a better parquet made out of frustration with using parquet on object storage. ORC has indexing, which makes it a lot easier to distribute chunks to partitions for distributed computing, for instance.
It would be interesting to have ORC's advantageous features appear in the higher level API as something that dependent libraries (i.e. datafusion) could then use ?
Let me know what you think !

@iajoiner
Copy link

iajoiner commented Jul 4, 2022

@jorgecarleitao Actually I’m also working on an ORC reader in Rust. I plan to add the writer as well.

See
https://issues.apache.org/jira/projects/ORC/issues/ORC-1180
https://issues.apache.org/jira/projects/ORC/issues/ORC-1181

@jorgecarleitao
Copy link
Owner Author

Hey @iajoiner , that is awesome to know! So, I have been working on this in the past and I finally got the time and mind (vacations!) space publish it

The implementation is available at https://github.com/DataEngineeringLabs/orc-format (https://crates.io/crates/orc-format) and contains the bare-bones to read ORC - I added integration tests against pyorc (the official implementation) of the things that work.

I wrote it as performant as I could, the only sub-performant piece is "bitunpacking", that afaik there is no performant implementation in Rust for u64 (for u32 there is bitpacking); I just implemented a (non-performant) that passes tests.

There is of course a lot of things missing from the spec. If you want we can pair up and work on it. I think that the main difference is that it is not using build.rs. The reason is that I really like to have the generated code easily available via IDE "click on struct/function", something that the build.rs takes away (since it is embedded via an include clause).

Note that, as I am doing with parquet2 and avro-schema, I do not declare an in-memory format in the crate and instead provide a toolkit (e.g. iterators, generics) to decompress,decode and deserialize from ORC (and use them in integration tests, where I use an in-memory format for testing purposes).

I am planning to start integrating that dependency in this project so that we can read into Arrow. This will offer important input about the API, whether we need bridge structs from proto, to help users, and further testing.

Let me know your thoughts (here or preferably on https://github.com/DataEngineeringLabs/orc-format).

@jorgecarleitao
Copy link
Owner Author

This has been closed by #1189 🎉🎉🎉

@jorgecarleitao jorgecarleitao added the no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog label Jul 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
investigation Issues or PRs that are investigations. Prs may or may not be merged. no-changelog Issues whose changes are covered by a PR and thus should not be shown in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants