-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements for dictionary encodings and nested values #7
Improvements for dictionary encodings and nested values #7
Conversation
… dictionary ids and nested primitives where some levels are requires
let values = vec![ | ||
4, 0, 0, 0, // length | ||
0b00001000, // data | ||
0b00000001, 0b00000011, 0b00001010, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values are according to my understanding of the encoding, I did not yet validate them
let bytes = (indicator as usize >> 1) * self.num_bits as usize; | ||
let result = Some(HybridEncoded::Bitpacked(&self.values[..bytes])); | ||
self.values = &self.values[bytes..]; | ||
let num_bits = self.num_bits as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes mostly correspond to what was previously in rle_decode
in levels.rs
2 => inner.push(None), | ||
1 => outer.push(Some(Array::Int64(vec![]))), | ||
0 => outer.push(None), | ||
match max_def - def { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fully sure this is correct, should behave the same as before but I think only the repetition levels should influence when a new array starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, thank you so much for this PR, @jhorstmann, what an honor!
I have reviewed it and left a comment. I think that we need to think this through to cover both our use-cases.
I left some comments. Do you think it would be possible to divide this PR in smaller parts? It is touching a couple of things at the same time.
|
||
let pack = &self.values[0..rle_bytes]; | ||
let mut value_bytes = [0u8; std::mem::size_of::<u32>()]; | ||
pack.iter() | ||
.enumerate() | ||
.for_each(|(i, byte)| value_bytes[i] = *byte); | ||
let value = u32::from_le_bytes(value_bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the reason I do not want to push this here is that, in the arrow format, validity is in bytes as LSB, which is the same as parquet when rep level = 0. This particular change would require a round trip <u8 in LSB> -> i32 -> <u8 in LSB>
for non-nested types, which has a (significant) performance hit.
This is why I tried to offer the "raw" bytes here, and provide a higher-end API based on i32
closer to the deserialization.
So, the idea was: if people want the raw stuff, use the decoder, else, use the rle_decode
in levels.rs
. However, I do agree that rle_decode
is suboptimal as it allocates. What do you think about keeping this decoder as is, and instead have rle_decode
implemented as returning an iterator? It may solve both of our use-cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about keeping this decoder as is, and instead have
rle_decode
implemented as returning an iterator? It may solve both of our use-cases?
That sounds like it should work, seems all users call into_iter
on the rle decoded vec anyway. It should then even be possible to expose the iterator directly in the api instead of first creating a Vec<Option<Vec<u8>>>
for example, whose contents then have to be copied again into the arrow format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I realized that arrow2 is using the lower level api I understand what you actually meant :)
Do you think the code in arrow2 could be simplified by exposing an iterator api in parquet2, without loosing performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we already have an iterator API, that returns either variant, RLE or Bitpacked. This is made so that arrow2 can use it without having to incur the cost of bit-unpacking the bitpacked to then compare an i32
to 1 (max def level).
There will be a performance degradation if we move to iterators over i32, which is why I was trying to suggest to keep the encoder as is, and offer a higher-end iterator to iterate over it with i32
(the level.rs
).
Yes, agree, this turned a bit into a yak shaving exercise :) I'll split out the plain binary as that seems least controversial. Also the nested decoding belongs in a separate PR, and it seems I have to check the python tests for that too. |
@jorgecarleitao I saw the code for reading into the |
Sorry for not taking this PR onwards :( I tried to split them from the code because it was a bit confusing what that serialization was about, I hope it is ok for you. Definitely, I think that with the code there it will be easier to use it for testing. note that I refactored the |
@jhorstmann , most of the changes are implemented with #46 and #43 , and the remaining are done in the arrow2 crate in jorgecarleitao/arrow2#352 . Do you think we can close this? |
Agree, the issues are now fixed and this can be closed. |
Impressed by the code structure and how easy to understand the logic is, especially compared to other parquet libraries.
I tried it on some of the files we use at work and ran into some issues that I tried for fix here. The hybrid encoding can probably be solved in a cleaner way, I'm open for suggestions. The fixes are: