-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38860: [C++][Parquet] Using length to optimize bloom filter read #38863
Conversation
|
b08ea45
to
ea92c3e
Compare
ea92c3e
to
6a73d82
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. +1
Also cc @emkornfield This is added in format 2.10 |
@pitrou I've resolved the comments, would you mind take a look? |
cpp/src/parquet/bloom_filter.cc
Outdated
@@ -136,6 +144,15 @@ BlockSplitBloomFilter BlockSplitBloomFilter::Deserialize( | |||
bloom_filter.Init(header_buf->data() + header_size, bloom_filter_size); | |||
return bloom_filter; | |||
} | |||
if (bloom_filter_length && *bloom_filter_length < bloom_filter_size + header_size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is getting a bit confusing with bloom_filter_length
vs. bloom_filter_size
. Perhaps rename the latter to bloom_filter_data_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why the inequality? We should have *bloom_filter_length == bloom_filter_data_size + header_size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should be equal. let me check them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked now
}; | ||
std::vector<BloomFilterTestFile> files = { | ||
{"data_index_bloom_encoding_stats.parquet", false}, | ||
{"data_index_bloom_encoding_with_length.parquet", false}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why false
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm this is unused, because it will handled by BloomFilterReader
internal
Co-authored-by: Antoine Pitrou <[email protected]>
Thank you @mapleFU ! |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit be1dcdb. There were 3 benchmark results indicating a performance regression:
The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…read (apache#38863) ### Rationale for this change Parquet supports a bloom_filter_length in 2.10[1]. We'd like to using this length for read. The current implemention [2] using the code below: 1. Using a "guessed" header length to read the header. The header is likely to be 40B, but we use a larger value to avoid it evolves 2. From the header, we get the bloom filter length, and load it from input. Now, we can directly load the whole bloom-filter, without reading twice. We shouldn't remove the stale code because we need to read the stale file. We also need to generate a new parquet-testing file ( I can do this ASAP ) [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L824 [2] https://github.com/apache/arrow/blob/main/cpp/src/parquet/bloom_filter.cc#L117 ### What changes are included in this PR? * [x] Support Basic read with `bloom_filter_length` * [x] Enhance the JsonPrinter * [x] testing ### Are these changes tested? * [x] testing using parquet-testing ### Are there any user-facing changes? * Closes: apache#38860 Lead-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Rationale for this change
Parquet supports a bloom_filter_length in 2.10[1]. We'd like to using this length for read.
The current implemention [2] using the code below:
Now, we can directly load the whole bloom-filter, without reading twice. We shouldn't remove the stale code because we need to read the stale file.
We also need to generate a new parquet-testing file ( I can do this ASAP )
[1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L824
[2] https://github.com/apache/arrow/blob/main/cpp/src/parquet/bloom_filter.cc#L117
What changes are included in this PR?
bloom_filter_length
Are these changes tested?
Are there any user-facing changes?