Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for legacy lz4 decompression #151

Merged

Conversation

dantengsky
Copy link
Contributor

not really a PR, just test CI

@dantengsky
Copy link
Contributor Author

sorry, I should have open this PR in another repo first

@dantengsky dantengsky closed this Jun 22, 2022
@dantengsky
Copy link
Contributor Author

PR is ready for review.

Summary

For issue:

A new feature lz4_with_legacy_decompression is introduced, which will enable the legacy lz4 decompression (and Lz4raw as well) algorithm.

@dantengsky dantengsky reopened this Jun 22, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 22, 2022

Codecov Report

Merging #151 (c128f20) into main (34aac65) will decrease coverage by 0.52%.
The diff coverage is 86.84%.

@@            Coverage Diff             @@
##             main     #151      +/-   ##
==========================================
- Coverage   74.54%   74.02%   -0.53%     
==========================================
  Files          78       78              
  Lines        3646     3753     +107     
==========================================
+ Hits         2718     2778      +60     
- Misses        928      975      +47     
Impacted Files Coverage Δ
src/compression.rs 85.55% <86.84%> (+0.94%) ⬆️
src/encoding/plain_byte_array.rs 87.50% <0.00%> (-12.50%) ⬇️
src/page/mod.rs 70.16% <0.00%> (-11.75%) ⬇️
src/page/page_dict/fixed_len_binary.rs 55.00% <0.00%> (-9.71%) ⬇️
src/read/page/reader.rs 84.68% <0.00%> (-8.18%) ⬇️
src/page/page_dict/binary.rs 75.00% <0.00%> (-6.49%) ⬇️
src/encoding/hybrid_rle/decoder.rs 95.65% <0.00%> (-4.35%) ⬇️
src/deserialize/utils.rs 88.09% <0.00%> (-3.80%) ⬇️
src/schema/io_thrift/from_thrift.rs 82.14% <0.00%> (-1.50%) ⬇️
src/encoding/mod.rs 100.00% <0.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34aac65...c128f20. Read the comment docs.

@jorgecarleitao
Copy link
Owner

A couple of points:

  • On the read side, would it make sense to infer which format is being used and act based on that?
  • On the write side, I think that this would be better supported via a new variant on the Enum with compression - something like LegacyLz4 and Lz4.
  • We will need to think about how we are going to test this behavior - ideally we should write a compressed file and try to read it from pyarrow or spark - the roundtrip of the algorithm itself is imo insufficient because it may not roundtrip with other implementations.

@dantengsky dantengsky force-pushed the feat-legacy-lz4 branch 2 times, most recently from c128f20 to c23433e Compare June 26, 2022 16:55
@dantengsky
Copy link
Contributor Author

dantengsky commented Jun 26, 2022

Apologize for the late response!

Apparently, I failed to describe this pr clearly : (. For leacy, I mean the lz4 decompression behavior before PR

But, seems the Lz4HadoopCodec are more of interest? I 'd like to have a try.

A couple of points:

  • On the read side, would it make sense to infer which format is being used and act based on that?

Hope I get it right, does it mean things like these?

https://github.com/apache/arrow/blob/bf18e6e4b5bb6180706b1ba0d597a65a4ce5ca48/cpp/src/arrow/util/compression_lz4.cc#L417-L418

  • On the write side, I think that this would be better supported via a new variant on the Enum with compression - something like LegacyLz4 and Lz4.

IMHO, Lz4 compression codecs other than Lz4Raw should not be used, there've brought too many troubles :)

  • We will need to think about how we are going to test this behavior - ideally we should write a compressed file and try to read it from pyarrow or spark - the roundtrip of the algorithm itself is imo insufficient because it may not roundtrip with other implementations.

Yeah, I do not have a better idea of generating legacy lz4 compressed data, seems the lz4 of the latest pyarrrow means Lz4Raw?
Maybe c++/java could be used in the CI? too "heavy" though.

Currently, three files of parquet-testing are used in the testing of decompression:

Definitely need your guidance here

@jorgecarleitao
Copy link
Owner

I am 100% with you -

  • infer like cpp does
  • let's not write the old format and only support the read path

We are testing in integration against c++:

The roundtrip test over our writer is here: https://github.com/jorgecarleitao/parquet2/blob/main/tests/it/write/mod.rs#L125

Yeah, I do not have a better idea of generating legacy lz4 compressed data, seems the lz4 of the latest pyarrrow means Lz4Raw?

We could use a lower version of pyarrow - a new step in the CI where we install an older version of pyarrow (imo it is a good test to confirm that we support older versions, specially when they change like that).

@dantengsky
Copy link
Contributor Author

We could use a lower version of pyarrow - a new step in the CI where we install an older version of pyarrow (imo it is a good test to confirm that we support older versions, specially when they change like that).

Thanks, I'll try it

@jorgecarleitao jorgecarleitao merged commit 4319b05 into jorgecarleitao:main Jul 2, 2022
@jorgecarleitao jorgecarleitao added the feature A new feature label Jul 2, 2022
@jorgecarleitao
Copy link
Owner

This is amazing. Thank you so much, @dantengsky 🙇

@jorgecarleitao jorgecarleitao changed the title Supports legacy lz4 decompression Added support for legacy lz4 decompression Jul 2, 2022
@dantengsky
Copy link
Contributor Author

@jorgecarleitao appreciate your kindness

@jorgecarleitao
Copy link
Owner

fyi I released 0.14.1 with this in since there is nothing blocking us from doing so as this is backward compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants