Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved bitpacking #176

Merged
merged 4 commits into from
Aug 15, 2022
Merged

Improved bitpacking #176

merged 4 commits into from
Aug 15, 2022

Conversation

jorgecarleitao
Copy link
Owner

@jorgecarleitao jorgecarleitao commented Aug 12, 2022

This PR ports apache/arrow-rs#2278 to parquet2. Credit to the design and implementation of the unpacking path go to @tustvold - it is 5-10% faster than the bitpacking crate 🚀

Additionally, it adds the corresponding packing code path, thereby completely replacing the dependency on bitpacking.

It also adds some traits that allows code to be written via generics.

A curious observation is that, with this PR, parquet2 no longer executes unsafe code (bitpacking had some) 🎉

Backward changes:

  • renamed parquet2::encoding::bitpacking to parquet2::encoding::bitpacked
  • parquet2::encoding::bitpacked::Decoder now has a generic parameter (output type)
  • parquet2::encoding::bitpacked::Decoder::new's second parameter is now a usize

@codecov-commenter
Copy link

codecov-commenter commented Aug 12, 2022

Codecov Report

Merging #176 (1f597af) into main (d1c012c) will increase coverage by 0.03%.
The diff coverage is 89.81%.

@@            Coverage Diff             @@
##             main     #176      +/-   ##
==========================================
+ Coverage   85.29%   85.32%   +0.03%     
==========================================
  Files          78       82       +4     
  Lines        7916     8110     +194     
==========================================
+ Hits         6752     6920     +168     
- Misses       1164     1190      +26     
Impacted Files Coverage Δ
src/encoding/mod.rs 100.00% <ø> (ø)
src/write/file.rs 92.89% <ø> (-0.05%) ⬇️
src/encoding/bitpacked/mod.rs 71.27% <71.27%> (ø)
src/encoding/bitpacked/unpack.rs 88.57% <88.57%> (ø)
src/encoding/bitpacked/pack.rs 89.39% <89.39%> (ø)
src/encoding/bitpacked/encode.rs 96.77% <96.77%> (ø)
src/encoding/bitpacked/decode.rs 98.47% <98.47%> (ø)
src/encoding/delta_bitpacked/decoder.rs 99.22% <100.00%> (ø)
src/encoding/delta_bitpacked/encoder.rs 100.00% <100.00%> (ø)
src/encoding/delta_bitpacked/mod.rs 100.00% <100.00%> (+27.69%) ⬆️
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jorgecarleitao
Copy link
Owner Author

jorgecarleitao commented Aug 12, 2022

@tustvold, you may wish to port the packing part to parquet.

If it would be ok for you, the packing and unpacking functions that we wrote could live in a separate crate. I think that they are well encapsulated and folks that want this encoding somewhere could benefit from it. I for once would like to use them in https://github.com/DataEngineeringLabs/orc-format, since ORC also has bitpacked runs.

@jorgecarleitao jorgecarleitao marked this pull request as draft August 12, 2022 19:00
@jorgecarleitao jorgecarleitao marked this pull request as ready for review August 14, 2022 21:45
@jorgecarleitao jorgecarleitao merged commit f11f3d9 into main Aug 15, 2022
jorgecarleitao added a commit that referenced this pull request Aug 15, 2022
This PR ports apache/arrow-rs#2278 to parquet2. Credit to the design and implementation of the unpacking path go to @tustvold - it is 5-10% faster than the bitpacking crate 🚀
Additionally, it adds the corresponding packing code path, thereby completely replacing the dependency on bitpacking.
It also adds some traits that allows code to be written via generics.
A curious observation is that, with this PR, parquet2 no longer executes unsafe code (bitpacking had some) 🎉
Backward changes:

renamed parquet2::encoding::bitpacking to parquet2::encoding::bitpacked
parquet2::encoding::bitpacked::Decoder now has a generic parameter (output type)
parquet2::encoding::bitpacked::Decoder::new's second parameter is now a usize
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants