Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements for dictionary encodings and nested values #7

Conversation

jhorstmann
Copy link
Contributor

Impressed by the code structure and how easy to understand the logic is, especially compared to other parquet libraries.

I tried it on some of the files we use at work and ran into some issues that I tried for fix here. The hybrid encoding can probably be solved in a cleaner way, I'm open for suggestions. The fixes are:

  • support for plain encoded binary columns (used when all/most values in the column are distinct)
  • handle null values in the hybrid rle/bitpacked dictionary encoding
  • my understanding of the format spec is that dictionary keys are always using the hybrid rle/bitpacked encoding instead of only bitpacking
  • the max definition level of nested columns depends on whether the column or the nested elements are required and known to be non-null

… dictionary ids and nested primitives where some levels are requires
let values = vec![
4, 0, 0, 0, // length
0b00001000, // data
0b00000001, 0b00000011, 0b00001010,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are according to my understanding of the encoding, I did not yet validate them

let bytes = (indicator as usize >> 1) * self.num_bits as usize;
let result = Some(HybridEncoded::Bitpacked(&self.values[..bytes]));
self.values = &self.values[bytes..];
let num_bits = self.num_bits as usize;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes mostly correspond to what was previously in rle_decode in levels.rs

2 => inner.push(None),
1 => outer.push(Some(Array::Int64(vec![]))),
0 => outer.push(None),
match max_def - def {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fully sure this is correct, should behave the same as before but I think only the repetition levels should influence when a new array starts.

Copy link
Owner

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, thank you so much for this PR, @jhorstmann, what an honor!

I have reviewed it and left a comment. I think that we need to think this through to cover both our use-cases.

I left some comments. Do you think it would be possible to divide this PR in smaller parts? It is touching a couple of things at the same time.

Comment on lines +64 to +70

let pack = &self.values[0..rle_bytes];
let mut value_bytes = [0u8; std::mem::size_of::<u32>()];
pack.iter()
.enumerate()
.for_each(|(i, byte)| value_bytes[i] = *byte);
let value = u32::from_le_bytes(value_bytes);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the reason I do not want to push this here is that, in the arrow format, validity is in bytes as LSB, which is the same as parquet when rep level = 0. This particular change would require a round trip <u8 in LSB> -> i32 -> <u8 in LSB> for non-nested types, which has a (significant) performance hit.

This is why I tried to offer the "raw" bytes here, and provide a higher-end API based on i32 closer to the deserialization.

So, the idea was: if people want the raw stuff, use the decoder, else, use the rle_decode in levels.rs. However, I do agree that rle_decode is suboptimal as it allocates. What do you think about keeping this decoder as is, and instead have rle_decode implemented as returning an iterator? It may solve both of our use-cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about keeping this decoder as is, and instead have rle_decode implemented as returning an iterator? It may solve both of our use-cases?

That sounds like it should work, seems all users call into_iter on the rle decoded vec anyway. It should then even be possible to expose the iterator directly in the api instead of first creating a Vec<Option<Vec<u8>>> for example, whose contents then have to be copied again into the arrow format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I realized that arrow2 is using the lower level api I understand what you actually meant :)
Do you think the code in arrow2 could be simplified by exposing an iterator api in parquet2, without loosing performance?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we already have an iterator API, that returns either variant, RLE or Bitpacked. This is made so that arrow2 can use it without having to incur the cost of bit-unpacking the bitpacked to then compare an i32 to 1 (max def level).

There will be a performance degradation if we move to iterators over i32, which is why I was trying to suggest to keep the encoder as is, and offer a higher-end iterator to iterate over it with i32 (the level.rs).

@jhorstmann
Copy link
Contributor Author

Do you think it would be possible to divide this PR in smaller parts? It is touching a couple of things at the same time.

Yes, agree, this turned a bit into a yak shaving exercise :) I'll split out the plain binary as that seems least controversial. Also the nested decoding belongs in a separate PR, and it seems I have to check the python tests for that too.

@jhorstmann
Copy link
Contributor Author

@jorgecarleitao I saw the code for reading into the Array enum with nested vectors got moved to the integration-tests crate, making it clearer that is not the implementation used in production. Would you still be interested in a PR for those tests that add support for required lists and lists with required items?

@jorgecarleitao
Copy link
Owner

Sorry for not taking this PR onwards :(

I tried to split them from the code because it was a bit confusing what that serialization was about, I hope it is ok for you.

Definitely, I think that with the code there it will be easier to use it for testing.

note that I refactored the levels module to remain in the main package, and included an iterator over the levels, which can now be used by other deserializers (including the integration tests).

@jorgecarleitao
Copy link
Owner

@jhorstmann , most of the changes are implemented with #46 and #43 , and the remaining are done in the arrow2 crate in jorgecarleitao/arrow2#352 . Do you think we can close this?

@jhorstmann
Copy link
Contributor Author

Agree, the issues are now fixed and this can be closed.

@jhorstmann jhorstmann closed this Aug 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants