-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for plain encoded binary in data pages #8
Support for plain encoded binary in data pages #8
Conversation
This would allow to push this down to parquet2, right? |
I have to admit I did not notice the parquet specific code in arrow2 before, I was assuming arrow2 used the Adding support for writing plain binary data here surely makes sense. I can base it on the |
ahaha, yeah, the I do not want to tie |
@@ -85,3 +89,9 @@ def write_pyarrow(case, size = 1, page_version = 1): | |||
write_pyarrow(case_basic_required, 1, 2) # V2 | |||
|
|||
write_pyarrow(case_nested, 1, 1) | |||
|
|||
# pyarrow seems to write corrupt file when disabling dictionary encoding for nullable strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgecarleitao I added tests but seem to have run into a problem with pyarrow itself. Have you seen something like this before? For now I have ignored the corresponding test.
Codecov Report
@@ Coverage Diff @@
## main #8 +/- ##
==========================================
- Coverage 77.53% 77.07% -0.47%
==========================================
Files 57 60 +3
Lines 2711 2822 +111
==========================================
+ Hits 2102 2175 +73
- Misses 609 647 +38
Continue to review full report at Codecov.
|
Thanks a lot, @jhorstmann , reall great stuff and testing. I will report that pyarrow bug to JIRA |
Support for plain encoded binary columns (used when all/most values in the column are distinct), split out from #7.
I now remember that this is the same logic as in
page_dict::binary::read_plain
so it might make sense to reuse the iterator there. The iterator should get asize_hint
function so the correct capacity gets allocated.