Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Using length to optimize bloom filter read #38860

Closed
mapleFU opened this issue Nov 23, 2023 · 1 comment · Fixed by #38863
Closed

[C++][Parquet] Using length to optimize bloom filter read #38860

mapleFU opened this issue Nov 23, 2023 · 1 comment · Fixed by #38863

Comments

@mapleFU
Copy link
Member

mapleFU commented Nov 23, 2023

Describe the enhancement requested

Parquet supports a bloom_filter_length in 2.10[1]. We'd like to using this length for read.

The current implemention [2] using the code below:

  1. Using a "guessed" header length to read the header. The header is likely to be 40B, but we use a larger value to avoid it evolves
  2. From the header, we get the bloom filter length, and load it from input.

Now, we can directly load the whole bloom-filter, without reading twice. We shouldn't remove the stale code because we need to read the stale file.

We also need to generate a new parquet-testing file ( I can do this ASAP )

[1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L824
[2] https://github.com/apache/arrow/blob/main/cpp/src/parquet/bloom_filter.cc#L117

Component(s)

C++, Parquet

@mapleFU
Copy link
Member Author

mapleFU commented Nov 23, 2023

I've generate test data in apache/parquet-testing#43

pitrou added a commit that referenced this issue Nov 29, 2023
…38863)

### Rationale for this change

Parquet supports a bloom_filter_length in 2.10[1]. We'd like to using this length for read.

The current implemention [2] using the code below:

1. Using a "guessed" header length to read the header. The header is likely to be 40B, but we use a larger value to avoid it evolves
2. From the header, we get the bloom filter length, and load it from input.

Now, we can directly load the whole bloom-filter, without reading twice. We shouldn't remove the stale code because we need to read the stale file.

We also need to generate a new parquet-testing file ( I can do this ASAP )

[1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L824
[2] https://github.com/apache/arrow/blob/main/cpp/src/parquet/bloom_filter.cc#L117

### What changes are included in this PR?

* [x] Support Basic read with `bloom_filter_length`
* [x] Enhance the JsonPrinter
* [x] testing

### Are these changes tested?

* [x] testing using parquet-testing

### Are there any user-facing changes?

* Closes: #38860

Lead-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou pitrou added this to the 15.0.0 milestone Nov 29, 2023
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…read (apache#38863)

### Rationale for this change

Parquet supports a bloom_filter_length in 2.10[1]. We'd like to using this length for read.

The current implemention [2] using the code below:

1. Using a "guessed" header length to read the header. The header is likely to be 40B, but we use a larger value to avoid it evolves
2. From the header, we get the bloom filter length, and load it from input.

Now, we can directly load the whole bloom-filter, without reading twice. We shouldn't remove the stale code because we need to read the stale file.

We also need to generate a new parquet-testing file ( I can do this ASAP )

[1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L824
[2] https://github.com/apache/arrow/blob/main/cpp/src/parquet/bloom_filter.cc#L117

### What changes are included in this PR?

* [x] Support Basic read with `bloom_filter_length`
* [x] Enhance the JsonPrinter
* [x] testing

### Are these changes tested?

* [x] testing using parquet-testing

### Are there any user-facing changes?

* Closes: apache#38860

Lead-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants