Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: try abandon internal parquet2 patches #6067

Merged

Conversation

dantengsky
Copy link
Member

@dantengsky dantengsky commented Jun 20, 2022

I hereby agree to the terms of the CLA available at: https://databend.rs/dev/policies/cla/

Summary

  • partially fixes:

    All internal patches except the “legacy lz4”, are abandoned.
    Before Fixed LZ4 jorgecarleitao/parquet2#95, there is an implementation of Lz4, which is not compatible with the cpp / java implementations. IMO, it is not reasonable(even if feasible) to ask upstream to keep it.
    Unfortunately, at least, some stateful test data are compressed by using the non-standard "legacy" lz4.
    Thus this PR still uses a patched version (in repo datafuse-extras). I suggest abandon this non-standard "legacy" lz4 patch in another PR (after we have made sure that no such legacy data exist)

  • arrow2 upgraded to 0.12

    1.) type ArrayRef = Arc<dyn Array> of former arrow2 has been removed, instead Box<dyn Array> is used.
    so, in databend, type ArrayRef is redefined as Box<dyn Array> (and did the refactors by following rustc).
    Not sure if the name is still suitable, any suggestions are welcome.
    2.) replace deprecated null_count with unset_bits
    3.) serialize_batch now returns Result<_> instead of tuple, related codes are adjusted accordingly.

  • parquet2 upgraded to 0.14

Changelog

  • Improvement
  • Not for changelog (changelog entry is not required)

Related Issues

Fixes #6064

@vercel
Copy link

vercel bot commented Jun 20, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Updated
databend ⬜️ Ignored (Inspect) Jul 7, 2022 at 1:35AM (UTC)

@mergify
Copy link
Contributor

mergify bot commented Jun 20, 2022

Thanks for the contribution!
I have applied any labels matching special text in your PR Changelog.

Please review the labels and make any necessary changes.

@dantengsky
Copy link
Member Author

this PR will be postponed for a while, trying to port the supports legacy lz4 patch to upstream

@mergify

This comment was marked as off-topic.

@Xuanwo Xuanwo marked this pull request as ready for review June 28, 2022 05:44
@BohuTANG
Copy link
Member

BohuTANG commented Jul 1, 2022

This PR ready for review? @dantengsky

@dantengsky
Copy link
Member Author

This PR ready for review? @dantengsky

not yet. hopefully, it will be ready next week. but if anything is blocked by this, I'd like to split this PR into two

@mergify
Copy link
Contributor

mergify bot commented Jul 1, 2022

This pull request's title is not fulfill the requirements. @dantengsky please update it 🙏.

The title should contain one if the following tags:

  • feat: this PR introduces a new feature to the codebase
  • fix: this PR patches a bug in codebase
  • refactor: this PR changes the code base without new features or bugfix
  • ci|build: this PR changes build/testing/ci steps
  • docs|website: this PR changes the documents or websites
  • chore: this PR only has small changes that no need to record

@BohuTANG BohuTANG marked this pull request as draft July 1, 2022 12:18
@dantengsky
Copy link
Member Author

This PR ready for review? @dantengsky

Oh, sorry just found I mistakenly marked this PR as ready for review.

@dantengsky dantengsky changed the title Improvement: abandon internal parquet2 patches refactor: abandon internal parquet2 patches Jul 4, 2022
@mergify mergify bot added the pr-refactor this PR changes the code base without new features or bugfix label Jul 4, 2022
Pull `ArrayRef` out of namespace `arrow::array`
@dantengsky dantengsky mentioned this pull request Jul 6, 2022
common/arrow/src/lib.rs Outdated Show resolved Hide resolved
common/arrow/src/parquet_write.rs Show resolved Hide resolved
common/datavalues/src/columns/struct_/mod.rs Outdated Show resolved Hide resolved
@Xuanwo
Copy link
Member

Xuanwo commented Jul 6, 2022

This PR seems ready now.

@dantengsky Are there other issues that need addressing?

@dantengsky
Copy link
Member Author

This PR seems ready now.

@dantengsky Are there other issues that need addressing?

stateful test not passed yet :-( (due to silly conditional compilation flags that I set up in our internal parquet2 repo)

@dantengsky dantengsky force-pushed the feat-abandon-internal-parquet2-patches branch from 66a8775 to 662fa16 Compare July 6, 2022 14:03
@dantengsky dantengsky marked this pull request as ready for review July 7, 2022 01:25
@dantengsky dantengsky marked this pull request as draft July 7, 2022 01:26
@dantengsky
Copy link
Member Author

sorry, revert back to draft (there are typos to be corrected)

@dantengsky dantengsky marked this pull request as ready for review July 7, 2022 03:01
@dantengsky dantengsky requested review from Xuanwo and zhang2014 and removed request for zhang2014 July 7, 2022 03:02
@BohuTANG BohuTANG merged commit e267f7a into databendlabs:main Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need-review pr-refactor this PR changes the code base without new features or bugfix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improvement: abandon internal patches of parquet2
4 participants