Improvements for dictionary encodings and nested values #7

jhorstmann · 2021-05-11T17:28:16Z

Impressed by the code structure and how easy to understand the logic is, especially compared to other parquet libraries.

I tried it on some of the files we use at work and ran into some issues that I tried for fix here. The hybrid encoding can probably be solved in a cleaner way, I'm open for suggestions. The fixes are:

support for plain encoded binary columns (used when all/most values in the column are distinct)
handle null values in the hybrid rle/bitpacked dictionary encoding
my understanding of the format spec is that dictionary keys are always using the hybrid rle/bitpacked encoding instead of only bitpacking
the max definition level of nested columns depends on whether the column or the nested elements are required and known to be non-null

… dictionary ids and nested primitives where some levels are requires

jhorstmann · 2021-05-11T17:29:17Z

src/encoding/hybrid_rle/decoder.rs

+        let values = vec![
+            4, 0, 0, 0,          // length
+            0b00001000, // data
+            0b00000001, 0b00000011, 0b00001010,


These values are according to my understanding of the encoding, I did not yet validate them

jhorstmann · 2021-05-11T17:30:56Z

src/encoding/hybrid_rle/decoder.rs

-            let bytes = (indicator as usize >> 1) * self.num_bits as usize;
-            let result = Some(HybridEncoded::Bitpacked(&self.values[..bytes]));
-            self.values = &self.values[bytes..];
+            let num_bits = self.num_bits as usize;


These changes mostly correspond to what was previously in rle_decode in levels.rs

jhorstmann · 2021-05-11T17:34:45Z

src/serialization/read/primitive_nested.rs

-                2 => inner.push(None),
-                1 => outer.push(Some(Array::Int64(vec![]))),
-                0 => outer.push(None),
+            match max_def - def {


I'm not fully sure this is correct, should behave the same as before but I think only the repetition levels should influence when a new array starts.

jorgecarleitao

Wow, thank you so much for this PR, @jhorstmann, what an honor!

I have reviewed it and left a comment. I think that we need to think this through to cover both our use-cases.

I left some comments. Do you think it would be possible to divide this PR in smaller parts? It is touching a couple of things at the same time.

jorgecarleitao · 2021-05-11T17:43:08Z

src/encoding/hybrid_rle/decoder.rs

+
+            let pack = &self.values[0..rle_bytes];
+            let mut value_bytes = [0u8; std::mem::size_of::<u32>()];
+            pack.iter()
+                .enumerate()
+                .for_each(|(i, byte)| value_bytes[i] = *byte);
+            let value = u32::from_le_bytes(value_bytes);


So, the reason I do not want to push this here is that, in the arrow format, validity is in bytes as LSB, which is the same as parquet when rep level = 0. This particular change would require a round trip <u8 in LSB> -> i32 -> <u8 in LSB> for non-nested types, which has a (significant) performance hit.

This is why I tried to offer the "raw" bytes here, and provide a higher-end API based on i32 closer to the deserialization.

So, the idea was: if people want the raw stuff, use the decoder, else, use the rle_decode in levels.rs. However, I do agree that rle_decode is suboptimal as it allocates. What do you think about keeping this decoder as is, and instead have rle_decode implemented as returning an iterator? It may solve both of our use-cases?

What do you think about keeping this decoder as is, and instead have rle_decode implemented as returning an iterator? It may solve both of our use-cases?

That sounds like it should work, seems all users call into_iter on the rle decoded vec anyway. It should then even be possible to expose the iterator directly in the api instead of first creating a Vec<Option<Vec<u8>>> for example, whose contents then have to be copied again into the arrow format.

Now that I realized that arrow2 is using the lower level api I understand what you actually meant :)
Do you think the code in arrow2 could be simplified by exposing an iterator api in parquet2, without loosing performance?

So, we already have an iterator API, that returns either variant, RLE or Bitpacked. This is made so that arrow2 can use it without having to incur the cost of bit-unpacking the bitpacked to then compare an i32 to 1 (max def level).

There will be a performance degradation if we move to iterators over i32, which is why I was trying to suggest to keep the encoder as is, and offer a higher-end iterator to iterate over it with i32 (the level.rs).

jhorstmann · 2021-05-11T18:50:34Z

Do you think it would be possible to divide this PR in smaller parts? It is touching a couple of things at the same time.

Yes, agree, this turned a bit into a yak shaving exercise :) I'll split out the plain binary as that seems least controversial. Also the nested decoding belongs in a separate PR, and it seems I have to check the python tests for that too.

jhorstmann · 2021-07-04T14:05:31Z

@jorgecarleitao I saw the code for reading into the Array enum with nested vectors got moved to the integration-tests crate, making it clearer that is not the implementation used in production. Would you still be interested in a PR for those tests that add support for required lists and lists with required items?

jorgecarleitao · 2021-07-04T16:06:08Z

Sorry for not taking this PR onwards :(

I tried to split them from the code because it was a bit confusing what that serialization was about, I hope it is ok for you.

Definitely, I think that with the code there it will be easier to use it for testing.

note that I refactored the levels module to remain in the main package, and included an iterator over the levels, which can now be used by other deserializers (including the integration tests).

jorgecarleitao · 2021-08-28T08:41:20Z

@jhorstmann , most of the changes are implemented with #46 and #43 , and the remaining are done in the arrow2 crate in jorgecarleitao/arrow2#352 . Do you think we can close this?

jhorstmann · 2021-08-28T09:19:17Z

Agree, the issues are now fixed and this can be closed.

Support plain encoding for binary types , nullable and hybrid encoded…

a1049b8

… dictionary ids and nested primitives where some levels are requires

jhorstmann commented May 11, 2021

View reviewed changes

jorgecarleitao reviewed May 11, 2021

View reviewed changes

jhorstmann mentioned this pull request May 11, 2021

Support for plain encoded binary in data pages #8

Merged

jorgecarleitao mentioned this pull request Aug 28, 2021

Made tests deserializer use the correct decoder. #46

Merged

jhorstmann closed this Aug 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements for dictionary encodings and nested values #7

Improvements for dictionary encodings and nested values #7

jhorstmann commented May 11, 2021

jhorstmann May 11, 2021

jhorstmann May 11, 2021

jhorstmann May 11, 2021

jorgecarleitao left a comment

jorgecarleitao May 11, 2021

jhorstmann May 11, 2021

jhorstmann May 11, 2021

jorgecarleitao May 12, 2021

jhorstmann commented May 11, 2021

jhorstmann commented Jul 4, 2021

jorgecarleitao commented Jul 4, 2021

jorgecarleitao commented Aug 28, 2021

jhorstmann commented Aug 28, 2021

Improvements for dictionary encodings and nested values #7

Improvements for dictionary encodings and nested values #7

Conversation

jhorstmann commented May 11, 2021

jhorstmann May 11, 2021

Choose a reason for hiding this comment

jhorstmann May 11, 2021

Choose a reason for hiding this comment

jhorstmann May 11, 2021

Choose a reason for hiding this comment

jorgecarleitao left a comment

Choose a reason for hiding this comment

jorgecarleitao May 11, 2021

Choose a reason for hiding this comment

jhorstmann May 11, 2021

Choose a reason for hiding this comment

jhorstmann May 11, 2021

Choose a reason for hiding this comment

jorgecarleitao May 12, 2021

Choose a reason for hiding this comment

jhorstmann commented May 11, 2021

jhorstmann commented Jul 4, 2021

jorgecarleitao commented Jul 4, 2021

jorgecarleitao commented Aug 28, 2021

jhorstmann commented Aug 28, 2021