-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DeltaBitPack MiniBlock Bit Width Padding #1418
Conversation
Ignore non-zero padded bit widths in DeltaBitPackDecoder (apache#1417)
I have tested Datafusion against this cherry-picked onto the 10.0.0 release and can confirm it fixes apache/datafusion#1976. The branch is here. I do wonder if it might be an interesting exercise for me to help cut a patch release for this so I can get some exposure to this process, thoughts @alamb? |
@@ -668,9 +667,9 @@ where | |||
|
|||
read += batch_read; | |||
self.mini_block_remaining -= batch_read; | |||
self.values_left -= batch_read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix for #1417 - the problem was that as self.value_left
was only decremented outside the loop, the code in next_block
was unable to correctly determine when the bit widths were padding
@@ -581,6 +581,13 @@ impl<T: DataType> DeltaBitPackEncoder<T> { | |||
// values left | |||
let n = cmp::min(self.mini_block_size, self.values_in_block); | |||
if n == 0 { | |||
// Decoders should be agnostic to the padding value, we therefore use 0xFF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the fix for #1416, I opted to pad with a non-zero value for tests so we have coverage of that scenario, but not produce such data outside of tests for better ecosystem-interop
@tustvold is this a regression introduced in parquet? If so, which version was it introduced in (9.1 maybe?) |
Was it introduced in #1284 maybe? |
Maybe we wait to make a backport to see if anyone else hits the issue before we spend time cutting a new release? I have no idea how prevalent it will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR needs some tests
@@ -610,11 +617,7 @@ impl<T: DataType> DeltaBitPackEncoder<T> { | |||
self.values_in_block -= n; | |||
} | |||
|
|||
assert!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this change -- the older assert seemed to be more specific, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah - I was under the impression assert_eq! logged the textual representation of the arguments, but it does not - will fix
By making it pad with Edit: I will see if I can come up with a standalone test for #1417 |
Codecov Report
@@ Coverage Diff @@
## master #1418 +/- ##
==========================================
- Coverage 83.17% 83.17% -0.01%
==========================================
Files 182 182
Lines 53439 53443 +4
==========================================
+ Hits 44449 44451 +2
- Misses 8990 8992 +2
Continue to review full report at Codecov.
|
I added a test for #1417 and whilst doing so fixed a bug where a truncated page could get the decoder stuck in an infinite loop |
@@ -652,6 +652,14 @@ where | |||
.bit_reader | |||
.get_batch(&mut buffer[read..read + batch_to_read], bit_width); | |||
|
|||
if batch_read != batch_to_read { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to this change, the decoder would get stuck in an infinite loop if given a truncated page. I discovered this by accident whilst writing test_delta_bit_packed_padding
6c02096
to
0b047cc
Compare
let mut decoder = DeltaBitPackDecoder::<Int32Type>::new(); | ||
decoder.set_data(ptr.clone(), 0).unwrap(); | ||
assert_eq!(decoder.get(&mut output).unwrap(), 419); | ||
assert_eq!(decoder.get_offset(), length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the fix for #1417 this would fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks good to me. Thanks for identifying the bug!
+1 on having 9.1.1 and 10.0.1 branches for the fix.
parquet/src/encodings/decoding.rs
Outdated
@@ -444,7 +443,6 @@ pub struct DeltaBitPackDecoder<T: DataType> { | |||
values_per_mini_block: usize, | |||
|
|||
// Per block info | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was cleaner, but don't feel strongly so reverted
I am happy to make 9.1.1 and 10.0.1 branches and prepare the releases. @sunchao would you recommend we proactively release 9.1.1 and 10.0.1? If so I'll make a ticket and start the process. It is probably the right thing to do I suppose -- I was just dreading the voting process :) |
#1442 tracking the patch releases |
Which issue does this PR close?
Closes #1416
Closes #1417
Rationale for this change
Fixes apache/datafusion#1976
What changes are included in this PR?
Makes DeltaBitPackEncoder padding deterministic, and makes DeltaBitPackDecoder resilient to non-zero padding
Are there any user-facing changes?
Output that was previously somewhat non-deterministic, is now deterministic.