-
Notifications
You must be signed in to change notification settings - Fork 224
Add support to read from Apache ORC #759
Comments
AFAIK Orc is a better parquet made out of frustration with using parquet on object storage. ORC has indexing, which makes it a lot easier to distribute chunks to partitions for distributed computing, for instance. |
@jorgecarleitao Actually I’m also working on an ORC reader in Rust. I plan to add the writer as well. See |
Hey @iajoiner , that is awesome to know! So, I have been working on this in the past and I finally got the time and mind (vacations!) space publish it The implementation is available at https://github.com/DataEngineeringLabs/orc-format (https://crates.io/crates/orc-format) and contains the bare-bones to read ORC - I added integration tests against pyorc (the official implementation) of the things that work. I wrote it as performant as I could, the only sub-performant piece is "bitunpacking", that afaik there is no performant implementation in Rust for u64 (for u32 there is There is of course a lot of things missing from the spec. If you want we can pair up and work on it. I think that the main difference is that it is not using Note that, as I am doing with parquet2 and avro-schema, I do not declare an in-memory format in the crate and instead provide a toolkit (e.g. iterators, generics) to decompress,decode and deserialize from ORC (and use them in integration tests, where I use an in-memory format for testing purposes). I am planning to start integrating that dependency in this project so that we can read into Arrow. This will offer important input about the API, whether we need bridge structs from proto, to help users, and further testing. Let me know your thoughts (here or preferably on https://github.com/DataEngineeringLabs/orc-format). |
This has been closed by #1189 🎉🎉🎉 |
The core development for this is being carried out here: https://github.com/jorgecarleitao/orc-rs. The hope is that once we can read stripes there, we can plug that here and deserialize to arrow, just like we do for parquet.
PRs over there are of course very welcome 🙇
The text was updated successfully, but these errors were encountered: