-
Notifications
You must be signed in to change notification settings - Fork 224
perf: elide unneeded offset checks in reading parquet binary/utf8 columns #1307
Conversation
pub fn into_inner(self) -> (ValidOffsets<O>, Vec<u8>) { | ||
// Safety: | ||
// the invariant that all offsets are monotonically increasing is upheld. | ||
let offsets = unsafe { ValidOffsets::new_unchecked(self.offsets.0.into()) }.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offsets
is private and only constructed by us, so the invariant is guaranteed.
Codecov ReportBase: 83.14% // Head: 83.11% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #1307 +/- ##
==========================================
- Coverage 83.14% 83.11% -0.04%
==========================================
Files 369 370 +1
Lines 40234 40314 +80
==========================================
+ Hits 33454 33508 +54
- Misses 6780 6806 +26
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There are some issues with building |
I was thinking about this and I think that what we are looking for here is to introduce In other words, we elevate the importance of What do you think about this approach? I think in json we have this same situation atm. I can work on this. |
Yeap, I have been thinking in the same direction, so must be a good idea. ^^ |
Draft PR for that: #1316 |
Superseeded by #1316 |
This would close #1306 and uses a
unsafe
conversion which is actually sound because we have carefully constructed theOffsets
and uphold its invariant. This saves 5% runtime in my local benchmark of reading utf8 columns. I expect the maximum performance increase is ~10% when we deal with one character strings.