Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

perf: elide unneeded offset checks in reading parquet binary/utf8 columns #1307

Closed
wants to merge 2 commits into from

Conversation

ritchie46
Copy link
Collaborator

This would close #1306 and uses a unsafe conversion which is actually sound because we have carefully constructed the Offsets and uphold its invariant. This saves 5% runtime in my local benchmark of reading utf8 columns. I expect the maximum performance increase is ~10% when we deal with one character strings.

pub fn into_inner(self) -> (ValidOffsets<O>, Vec<u8>) {
// Safety:
// the invariant that all offsets are monotonically increasing is upheld.
let offsets = unsafe { ValidOffsets::new_unchecked(self.offsets.0.into()) }.unwrap();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offsets is private and only constructed by us, so the invariant is guaranteed.

@codecov
Copy link

codecov bot commented Nov 28, 2022

Codecov Report

Base: 83.14% // Head: 83.11% // Decreases project coverage by -0.03% ⚠️

Coverage data is based on head (85110d5) compared to base (368aacc).
Patch coverage: 75.55% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1307      +/-   ##
==========================================
- Coverage   83.14%   83.11%   -0.04%     
==========================================
  Files         369      370       +1     
  Lines       40234    40314      +80     
==========================================
+ Hits        33454    33508      +54     
- Misses       6780     6806      +26     
Impacted Files Coverage Δ
src/array/mod.rs 64.93% <ø> (ø)
src/array/offsets.rs 55.55% <55.55%> (ø)
src/array/binary/mod.rs 90.30% <72.72%> (-2.23%) ⬇️
src/array/utf8/mod.rs 84.90% <73.91%> (-0.78%) ⬇️
src/array/specification.rs 91.50% <85.71%> (-0.89%) ⬇️
src/io/parquet/read/deserialize/binary/basic.rs 80.25% <100.00%> (-0.19%) ⬇️
src/io/parquet/read/deserialize/binary/utils.rs 68.80% <100.00%> (+1.81%) ⬆️
src/io/ipc/read/file.rs 96.42% <0.00%> (-0.45%) ⬇️
src/io/ipc/read/schema.rs 95.58% <0.00%> (-0.30%) ⬇️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ritchie46
Copy link
Collaborator Author

There are some issues with building memchr that are unrelated to this PR. I shall try to run the pipeline later today.

@jorgecarleitao
Copy link
Owner

I was thinking about this and I think that what we are looking for here is to introduce struct Offsets(Vec<O>) and struct OffsetsBuffer(Buffer<O>) that upheld the invariants. We then change all arrays that contain offsets to use these structures accordingly, and stop validating offsets on the arrays themselves (they are validated by OffsetsBuffer and Offsets).

In other words, we elevate the importance of ValidOffsets in this PR to a core struct of the arrays (and introduce the corresponding mutable version).

What do you think about this approach? I think in json we have this same situation atm.

I can work on this.

@ritchie46
Copy link
Collaborator Author

struct Offsets(Vec) and struct OffsetsBuffer(Buffer) that upheld the invariants.

Yeap, I have been thinking in the same direction, so must be a good idea. ^^

@jorgecarleitao
Copy link
Owner

Draft PR for that: #1316

@ritchie46
Copy link
Collaborator Author

Superseeded by #1316

@ritchie46 ritchie46 closed this Dec 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reconsider forbid unsafe
2 participants