GH-38860: [C++][Parquet] Using length to optimize bloom filter read #38863

mapleFU · 2023-11-23T12:41:40Z

Rationale for this change

Parquet supports a bloom_filter_length in 2.10[1]. We'd like to using this length for read.

The current implemention [2] using the code below:

Using a "guessed" header length to read the header. The header is likely to be 40B, but we use a larger value to avoid it evolves
From the header, we get the bloom filter length, and load it from input.

Now, we can directly load the whole bloom-filter, without reading twice. We shouldn't remove the stale code because we need to read the stale file.

We also need to generate a new parquet-testing file ( I can do this ASAP )

[1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L824
[2] https://github.com/apache/arrow/blob/main/cpp/src/parquet/bloom_filter.cc#L117

What changes are included in this PR?

Support Basic read with bloom_filter_length
Enhance the JsonPrinter
testing

Are these changes tested?

testing using parquet-testing

Are there any user-facing changes?

Closes: [C++][Parquet] Using length to optimize bloom filter read #38860

github-actions · 2023-11-23T12:42:09Z

⚠️ GitHub issue #38860 has been automatically assigned in GitHub to PR creator.

cpp/src/parquet/bloom_filter_reader.cc

cpp/src/parquet/bloom_filter.cc

cpp/src/parquet/bloom_filter.h

cpp/src/parquet/printer.cc

wgtmac

LGTM. +1

mapleFU · 2023-11-27T04:45:15Z

Also cc @emkornfield

This is added in format 2.10

mapleFU · 2023-11-29T05:13:32Z

@pitrou I've resolved the comments, would you mind take a look?

pitrou · 2023-11-29T10:24:07Z

cpp/src/parquet/bloom_filter.cc

@@ -136,6 +144,15 @@ BlockSplitBloomFilter BlockSplitBloomFilter::Deserialize(
    bloom_filter.Init(header_buf->data() + header_size, bloom_filter_size);
    return bloom_filter;
  }
+  if (bloom_filter_length && *bloom_filter_length < bloom_filter_size + header_size) {


The code is getting a bit confusing with bloom_filter_length vs. bloom_filter_size. Perhaps rename the latter to bloom_filter_data_size?

Also, why the inequality? We should have *bloom_filter_length == bloom_filter_data_size + header_size.

They should be equal. let me check them

Checked now

pitrou · 2023-11-29T10:24:55Z

cpp/src/parquet/bloom_filter_reader_test.cc

+  };
+  std::vector<BloomFilterTestFile> files = {
+      {"data_index_bloom_encoding_stats.parquet", false},
+      {"data_index_bloom_encoding_with_length.parquet", false},


Why false here?

Hmmm this is unused, because it will handled by BloomFilterReader internal

cpp/src/parquet/bloom_filter_reader_test.cc

cpp/src/parquet/bloom_filter.cc

Co-authored-by: Antoine Pitrou <[email protected]>

pitrou · 2023-11-29T12:56:43Z

Thank you @mapleFU !

conbench-apache-arrow · 2023-11-30T15:05:21Z

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit be1dcdb.

There were 3 benchmark results indicating a performance regression:

Commit Run on ursa-i9-9960x at 2023-11-30 12:41:02Z
- file-read (R) with compression=lz4, dataset=nyctaxi_2010-01, file_type=feather, language=R, output_type=table
- file-read (R) with compression=lz4, dataset=nyctaxi_2010-01, file_type=feather, language=R, output_type=dataframe
and 1 more (see the report linked below)

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

…read (apache#38863) ### Rationale for this change Parquet supports a bloom_filter_length in 2.10[1]. We'd like to using this length for read. The current implemention [2] using the code below: 1. Using a "guessed" header length to read the header. The header is likely to be 40B, but we use a larger value to avoid it evolves 2. From the header, we get the bloom filter length, and load it from input. Now, we can directly load the whole bloom-filter, without reading twice. We shouldn't remove the stale code because we need to read the stale file. We also need to generate a new parquet-testing file ( I can do this ASAP ) [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L824 [2] https://github.com/apache/arrow/blob/main/cpp/src/parquet/bloom_filter.cc#L117 ### What changes are included in this PR? * [x] Support Basic read with `bloom_filter_length` * [x] Enhance the JsonPrinter * [x] testing ### Are these changes tested? * [x] testing using parquet-testing ### Are there any user-facing changes? * Closes: apache#38860 Lead-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Nov 23, 2023

mapleFU force-pushed the parquet/support-read-bf-length branch from b08ea45 to ea92c3e Compare November 23, 2023 12:56

Add basic implement for read BloomFilter with length

6a73d82

mapleFU force-pushed the parquet/support-read-bf-length branch from ea92c3e to 6a73d82 Compare November 23, 2023 12:58

pitrou reviewed Nov 23, 2023

View reviewed changes

cpp/src/parquet/bloom_filter_reader.cc Show resolved Hide resolved

cpp/src/parquet/bloom_filter.cc Show resolved Hide resolved

github-actions bot added awaiting review Awaiting review awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 23, 2023

update data testing

f7535cc

github-actions bot added awaiting review Awaiting review and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels Nov 24, 2023

add deserialize test

e5875e9

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 24, 2023

mapleFU marked this pull request as ready for review November 24, 2023 03:54

mapleFU requested a review from wgtmac as a code owner November 24, 2023 03:54

wgtmac requested changes Nov 24, 2023

View reviewed changes

cpp/src/parquet/bloom_filter.cc Show resolved Hide resolved

cpp/src/parquet/bloom_filter.h Outdated Show resolved Hide resolved

cpp/src/parquet/printer.cc Outdated Show resolved Hide resolved

Resolve comments

3a87383

mapleFU requested review from pitrou and wgtmac November 24, 2023 15:18

wgtmac approved these changes Nov 24, 2023

View reviewed changes

pitrou requested changes Nov 29, 2023

View reviewed changes

mapleFU added 2 commits November 29, 2023 18:39

Merge branch 'main' into parquet/support-read-bf-length

bf32a8f

fix comment

6dc0aa5

mapleFU requested a review from pitrou November 29, 2023 11:03

pitrou reviewed Nov 29, 2023

View reviewed changes

cpp/src/parquet/bloom_filter.cc Outdated Show resolved Hide resolved

Update cpp/src/parquet/bloom_filter.cc

aba5d7c

Co-authored-by: Antoine Pitrou <[email protected]>

pitrou approved these changes Nov 29, 2023

View reviewed changes

pitrou merged commit be1dcdb into apache:main Nov 29, 2023
29 of 31 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-38860: [C++][Parquet] Using length to optimize bloom filter read #38863

GH-38860: [C++][Parquet] Using length to optimize bloom filter read #38863

mapleFU commented Nov 23, 2023 •

edited

Loading

github-actions bot commented Nov 23, 2023

wgtmac left a comment

mapleFU commented Nov 27, 2023 •

edited

Loading

mapleFU commented Nov 29, 2023

pitrou Nov 29, 2023

pitrou Nov 29, 2023

mapleFU Nov 29, 2023

mapleFU Nov 29, 2023

pitrou Nov 29, 2023

mapleFU Nov 29, 2023

pitrou commented Nov 29, 2023

conbench-apache-arrow bot commented Nov 30, 2023

GH-38860: [C++][Parquet] Using length to optimize bloom filter read #38863

GH-38860: [C++][Parquet] Using length to optimize bloom filter read #38863

Conversation

mapleFU commented Nov 23, 2023 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Nov 23, 2023

wgtmac left a comment

Choose a reason for hiding this comment

mapleFU commented Nov 27, 2023 • edited Loading

mapleFU commented Nov 29, 2023

pitrou Nov 29, 2023

Choose a reason for hiding this comment

pitrou Nov 29, 2023

Choose a reason for hiding this comment

mapleFU Nov 29, 2023

Choose a reason for hiding this comment

mapleFU Nov 29, 2023

Choose a reason for hiding this comment

pitrou Nov 29, 2023

Choose a reason for hiding this comment

mapleFU Nov 29, 2023

Choose a reason for hiding this comment

pitrou commented Nov 29, 2023

conbench-apache-arrow bot commented Nov 30, 2023

mapleFU commented Nov 23, 2023 •

edited

Loading

mapleFU commented Nov 27, 2023 •

edited

Loading