Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Separate flags for parquet2 compression formats? #920

Open
jenanwise opened this issue Mar 19, 2022 · 8 comments
Open

Separate flags for parquet2 compression formats? #920

jenanwise opened this issue Mar 19, 2022 · 8 comments
Labels
enhancement An improvement to an existing feature good first issue Good for newcomers

Comments

@jenanwise
Copy link

Right now, arrow2 exposes the io_parquet_compression feature flag to opt-in to the compression formats for parquet. However, this enables all of lz4, zstd, snappy, gzip, and brotli, each of which can be fairly heavyweight to build. I imagine most folks will only want to use one of the above — e.g., I am only using zstd.

On my laptop in a bare repository with thisCargo.toml:

[dependencies]
arrow2 = { version = "0.10.1", features = ["io_parquet"], default-features = false }

cargo +nightly build --release --timings

Unit Total Codegen
1. arrow2 v0.10.1 15.1s 11.1s (73%)
2. bitpacking v0.8.4 14.9s 7.9s (53%)
3. zstd-sys v1.6.3+zstd.1.5.2 build script (run) 12.5s
4. brotli v3.3.3 10.4s 8.0s (77%)
5. parquet-format-async-temp v0.2.0 8.3s 5.6s (67%)
6. arrow-format v0.4.0 5.8s 4.3s (74%)
7. serde_derive v1.0.136 5.6s
8. parquet2 v0.10.3 4.9s 4.1s (83%)
9. syn v1.0.89 4.1s 1.9s (46%)
10. futures-util v0.3.21 4.0s 0.7s (16%)
11. brotli-decompressor v2.3.2 3.9s 3.3s (85%)
12. serde v1.0.136 3.2s 0.4s (11%)
13. lz4-sys v1.9.3 build script (run) 2.9s  

Switching to this Cargo.toml:

[dependencies]
arrow2 = { version = "0.10.1", features = ["io_parquet"], default-features = false }
parquet2 = { version = "0.10", default_features = false, features = ["zstd"] }
Unit Total Codegen
1. arrow2 v0.10.1 13.5s 11.1s (82%)
2. bitpacking v0.8.4 12.5s 5.0s (40%)
3. zstd-sys v1.6.3+zstd.1.5.2 build script (run) 8.4s
4. parquet-format-async-temp v0.2.0 6.4s 3.9s (61%)
5. serde_derive v1.0.136 5.0s
6. futures-util v0.3.21 4.2s 1.0s (25%)
7. arrow-format v0.4.0 3.7s 2.5s (66%)
8. serde v1.0.136 3.2s 0.5s (16%)
9. syn v1.0.89 2.9s 1.3s (46%)
10. chrono v0.4.19 1.8s 1.1s (62%)
11. futures-macro v0.3.21 1.7s
12. parquet2 v0.10.3 1.6s 1.1s (67%)
13. async-trait v0.1.52 1.6s

It seems like there might be two solutions:

  1. advise folks who want finer-grained control to skip the io_parquet_compression flag, and instead directly put a parquet2 dependency in their Cargo.toml, as I'm doing above. This seems rather fragile.
  2. add individual io_parquet_compression_zstd etc flags.

Thoughts?

@jorgecarleitao
Copy link
Owner

Hi @jenanwise and thank you for your analysis. I think it is a great idea to split the flag in multiple subflags.

We may keep io_parquet_compression = ["io_parquet_compression_zstd", "io_parquet_compression_snappy", ...] since when reading, users usually want to support all compressions.

@jorgecarleitao
Copy link
Owner

@jenanwise , would you like to work on this? It is a good first issue.

@jorgecarleitao jorgecarleitao added the enhancement An improvement to an existing feature label Mar 22, 2022
@jenanwise
Copy link
Author

@jorgecarleitao Happily!

@jorgecarleitao jorgecarleitao added the good first issue Good for newcomers label Apr 22, 2022
@ozgrakkurt
Copy link
Contributor

+1, this is also needed for me to depend on arrow2 and datafusion simultaneously since they both use zstd-sys but they use different versions so it creates a conflict

@kylebarron
Copy link
Contributor

  1. advise folks who want finer-grained control to skip the io_parquet_compression flag, and instead directly put a parquet2 dependency in their Cargo.toml, as I'm doing above. This seems rather fragile

Just a note that this works fine for my needs 🤷‍♂️
https://github.com/kylebarron/parquet-wasm/blob/93c498484f997a85a97c049cf4c1cbacce04fab8/Cargo.toml#L24-L29. I turn on each individual parquet2 dependency as needed.

@ozgrakkurt
Copy link
Contributor

@kylebarron thanks!

@jenanwise
Copy link
Author

Apologies. I took a brief look — the implementation is trivial but I didn't get around to creating a test set. However, I'm no longer using arrow2, so the issue is up for grabs.

@MostlyAmiable
Copy link
Contributor

@jorgecarleitao Btw, it looks like this issue was addressed by #1207

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement An improvement to an existing feature good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

5 participants