Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for plain encoded binary in data pages #8

Conversation

jhorstmann
Copy link
Contributor

Support for plain encoded binary columns (used when all/most values in the column are distinct), split out from #7.

I now remember that this is the same logic as in page_dict::binary::read_plain so it might make sense to reuse the iterator there. The iterator should get a size_hint function so the correct capacity gets allocated.

@jorgecarleitao
Copy link
Owner

This would allow to push this down to parquet2, right?

@jhorstmann
Copy link
Contributor Author

I have to admit I did not notice the parquet specific code in arrow2 before, I was assuming arrow2 used the serialization::read::Array enum from parquet2 but it's actually using the lower level apis. I was already wondering why the representation in that enum looked so different from the arrow data model.

Adding support for writing plain binary data here surely makes sense. I can base it on the Array enum for now but it should also be possible to make it work with Iterator or IntoIterator later.

@jorgecarleitao
Copy link
Owner

ahaha, yeah, the Array representation is used here because we need something to run tests against pyarrow and the like, which is useful to verify that the encoding / decoding logic makes sense. Arrow uses the low-level API. Sorry that this wasn't very clear, I need to document better this crate.

I do not want to tie parquet2 to a specific in-memory model so that other people can build on top of this without having to go through arrow. In other words, keep this crate faithful to the parquet format alone.

src/serialization/read/mod.rs Show resolved Hide resolved
src/encoding/plain_byte_array/decoder.rs Show resolved Hide resolved
src/encoding/plain_byte_array/decoder.rs Outdated Show resolved Hide resolved
src/encoding/plain_byte_array/decoder.rs Outdated Show resolved Hide resolved
src/encoding/plain_byte_array/decoder.rs Show resolved Hide resolved
@@ -85,3 +89,9 @@ def write_pyarrow(case, size = 1, page_version = 1):
write_pyarrow(case_basic_required, 1, 2) # V2

write_pyarrow(case_nested, 1, 1)

# pyarrow seems to write corrupt file when disabling dictionary encoding for nullable strings
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgecarleitao I added tests but seem to have run into a problem with pyarrow itself. Have you seen something like this before? For now I have ignored the corresponding test.

@codecov-commenter
Copy link

Codecov Report

Merging #8 (5a9ec2c) into main (acc4763) will decrease coverage by 0.46%.
The diff coverage is 73.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main       #8      +/-   ##
==========================================
- Coverage   77.53%   77.07%   -0.47%     
==========================================
  Files          57       60       +3     
  Lines        2711     2822     +111     
==========================================
+ Hits         2102     2175      +73     
- Misses        609      647      +38     
Impacted Files Coverage Δ
src/encoding/mod.rs 100.00% <ø> (ø)
src/lib.rs 76.64% <55.88%> (-7.17%) ⬇️
src/encoding/plain_byte_array/decoder.rs 76.92% <76.92%> (ø)
src/serialization/read/binary.rs 97.43% <100.00%> (+1.43%) ⬆️
src/serialization/read/mod.rs 87.06% <100.00%> (+0.83%) ⬆️
src/read/page.rs 50.00% <0.00%> (-18.75%) ⬇️
src/read/page_dict/mod.rs 80.00% <0.00%> (-6.67%) ⬇️
src/error.rs 18.75% <0.00%> (-6.25%) ⬇️
src/serialization/read/primitive_nested.rs 57.69% <0.00%> (-3.07%) ⬇️
src/serialization/read/levels.rs 82.22% <0.00%> (-2.23%) ⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update acc4763...5a9ec2c. Read the comment docs.

@jorgecarleitao jorgecarleitao merged commit 7aa9e4f into jorgecarleitao:main May 19, 2021
@jorgecarleitao
Copy link
Owner

Thanks a lot, @jhorstmann , reall great stuff and testing. I will report that pyarrow bug to JIRA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants