Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38860: [C++][Parquet] Using length to optimize bloom filter read #38863

Merged
merged 7 commits into from
Nov 29, 2023

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Nov 23, 2023

Rationale for this change

Parquet supports a bloom_filter_length in 2.10[1]. We'd like to using this length for read.

The current implemention [2] using the code below:

  1. Using a "guessed" header length to read the header. The header is likely to be 40B, but we use a larger value to avoid it evolves
  2. From the header, we get the bloom filter length, and load it from input.

Now, we can directly load the whole bloom-filter, without reading twice. We shouldn't remove the stale code because we need to read the stale file.

We also need to generate a new parquet-testing file ( I can do this ASAP )

[1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L824
[2] https://github.com/apache/arrow/blob/main/cpp/src/parquet/bloom_filter.cc#L117

What changes are included in this PR?

  • Support Basic read with bloom_filter_length
  • Enhance the JsonPrinter
  • testing

Are these changes tested?

  • testing using parquet-testing

Are there any user-facing changes?

Copy link

⚠️ GitHub issue #38860 has been automatically assigned in GitHub to PR creator.

@mapleFU mapleFU force-pushed the parquet/support-read-bf-length branch from ea92c3e to 6a73d82 Compare November 23, 2023 12:58
@github-actions github-actions bot added awaiting review Awaiting review awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 23, 2023
@github-actions github-actions bot added awaiting review Awaiting review and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels Nov 24, 2023
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 24, 2023
@mapleFU mapleFU marked this pull request as ready for review November 24, 2023 03:54
@mapleFU mapleFU requested a review from wgtmac as a code owner November 24, 2023 03:54
cpp/src/parquet/bloom_filter.cc Show resolved Hide resolved
cpp/src/parquet/bloom_filter.h Outdated Show resolved Hide resolved
cpp/src/parquet/printer.cc Outdated Show resolved Hide resolved
@mapleFU mapleFU requested review from pitrou and wgtmac November 24, 2023 15:18
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1

@mapleFU
Copy link
Member Author

mapleFU commented Nov 27, 2023

Also cc @emkornfield

This is added in format 2.10

@mapleFU
Copy link
Member Author

mapleFU commented Nov 29, 2023

@pitrou I've resolved the comments, would you mind take a look?

@@ -136,6 +144,15 @@ BlockSplitBloomFilter BlockSplitBloomFilter::Deserialize(
bloom_filter.Init(header_buf->data() + header_size, bloom_filter_size);
return bloom_filter;
}
if (bloom_filter_length && *bloom_filter_length < bloom_filter_size + header_size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is getting a bit confusing with bloom_filter_length vs. bloom_filter_size. Perhaps rename the latter to bloom_filter_data_size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why the inequality? We should have *bloom_filter_length == bloom_filter_data_size + header_size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be equal. let me check them

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked now

};
std::vector<BloomFilterTestFile> files = {
{"data_index_bloom_encoding_stats.parquet", false},
{"data_index_bloom_encoding_with_length.parquet", false},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why false here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm this is unused, because it will handled by BloomFilterReader internal

cpp/src/parquet/bloom_filter_reader_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/bloom_filter_reader_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/bloom_filter_reader_test.cc Outdated Show resolved Hide resolved
@mapleFU mapleFU requested a review from pitrou November 29, 2023 11:03
@pitrou
Copy link
Member

pitrou commented Nov 29, 2023

Thank you @mapleFU !

@pitrou pitrou merged commit be1dcdb into apache:main Nov 29, 2023
29 of 31 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Nov 29, 2023
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit be1dcdb.

There were 3 benchmark results indicating a performance regression:

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…read (apache#38863)

### Rationale for this change

Parquet supports a bloom_filter_length in 2.10[1]. We'd like to using this length for read.

The current implemention [2] using the code below:

1. Using a "guessed" header length to read the header. The header is likely to be 40B, but we use a larger value to avoid it evolves
2. From the header, we get the bloom filter length, and load it from input.

Now, we can directly load the whole bloom-filter, without reading twice. We shouldn't remove the stale code because we need to read the stale file.

We also need to generate a new parquet-testing file ( I can do this ASAP )

[1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L824
[2] https://github.com/apache/arrow/blob/main/cpp/src/parquet/bloom_filter.cc#L117

### What changes are included in this PR?

* [x] Support Basic read with `bloom_filter_length`
* [x] Enhance the JsonPrinter
* [x] testing

### Are these changes tested?

* [x] testing using parquet-testing

### Are there any user-facing changes?

* Closes: apache#38860

Lead-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Using length to optimize bloom filter read
3 participants