Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-18420: [C++][Parquet] Introduce ColumnIndex & OffsetIndex #14803

Merged
merged 19 commits into from
Dec 13, 2022

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Dec 1, 2022

Basically this patch has defined the interface of ColumnIndex and OffsetIndex. Implementation classes are also provided to deserialize byte stream, wrap thrift message and provide access to their attributes.

BTW, the naming style throughout the code base looks confusing to me. I have tried to follow what I have understood from the parquet sub-directory. Please correct me if anything is incorrect.

@github-actions
Copy link

github-actions bot commented Dec 1, 2022

@wgtmac
Copy link
Member Author

wgtmac commented Dec 1, 2022

@pitrou @emkornfield Can you please take a look at the proposed interface and implementation? Once they are finalized, I will add unit tests. Thank you very much!

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the API, I didn't look at the implementation.

cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
/// \brief The maximum value of every valid page.
virtual const std::vector<T>& GetMaxValues() const = 0;

/// \brief Returns list of page index of all valid pages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to have this in addition from null_pages?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it easy to know which page_id is valid when the return values from line 92 are used.

cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.h Show resolved Hide resolved
@wgtmac wgtmac requested a review from pitrou December 2, 2022 03:03
@wgtmac
Copy link
Member Author

wgtmac commented Dec 2, 2022

In the last commit, I have removed all the per-page accessors. They are now actually thin wrappers to the corresponding thrift class. How about this? @emkornfield @pitrou

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @wgtmac . Here are some more comments.

cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.cc Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.cc Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.cc Outdated Show resolved Hide resolved
default:
break;
}
DCHECK(false) << "Should not be able to reach this code";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use arrow::Unreachable() from arrow/util/unreachable.h.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead can we remove the default from the switch? or at least throw an exception, crashing seems like the wrong choice if a new enum is added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead can we remove the default from the switch? or at least throw an exception, crashing seems like the wrong choice if a new enum is added.

Agreed. In this way, the developer gets noticed there is something to fix at compile time instead of run time.

cpp/src/parquet/page_index.h Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.cc Show resolved Hide resolved
cpp/src/parquet/page_index.cc Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.cc Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.cc Outdated Show resolved Hide resolved
cpp/src/parquet/page_index.cc Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Dec 6, 2022

I just realized: in this API, encoded_min_values and min_values use different conventions? (one is dense, the other is sparse).

@wgtmac
Copy link
Member Author

wgtmac commented Dec 6, 2022

I just realized: in this API, encoded_min_values and min_values use different conventions? (one is dense, the other is sparse).

Yes, the encoded values are directly populated from the thrift ColumnIndex object, with null pages filled by an empty string. The decoded values have excluded null pages so it is much easier to do binary search if the boundary order is ASC or DESC without considering null pages. This is especially effective to implement predicate push down.

@wgtmac
Copy link
Member Author

wgtmac commented Dec 6, 2022

I just realized: in this API, encoded_min_values and min_values use different conventions? (one is dense, the other is sparse).

After checking the potential use cases of ColumnIndex, I think it makes sense to follow same convention for encoded_min_values() and min_values() to have same number of pages. For predicate push down, the binary search for sorted column index can be simply done by iterating page_ids returned from non_null_page_indices(). The latest commit reflects this change. Please take a look, thanks! @pitrou

@pitrou
Copy link
Member

pitrou commented Dec 6, 2022

Thanks @wgtmac . Is it possible to add some tests for this? One option is to rely on data files from https://github.com/apache/parquet-testing/tree/master/data (you might also upload your own there if desired)

@wgtmac
Copy link
Member Author

wgtmac commented Dec 7, 2022

Thanks @wgtmac . Is it possible to add some tests for this? One option is to rely on data files from https://github.com/apache/parquet-testing/tree/master/data (you might also upload your own there if desired)

I have added page_index_test.cc to test deserialization of page index objects. Please take a look, thanks!

Comment on lines 247 to 253
const std::vector<size_t> page_indices = {0, 1};
const std::vector<bool> null_pages = {true, true};
// It seems that the null_counts are malformed.
const bool has_null_counts = true;
const std::vector<int64_t> null_counts = {-1, -1};
const std::vector<int32_t> min_values = {};
const std::vector<int32_t> max_values = {};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a bad example to be honest. Can you find or generate a test case which has both null and non-null pages in a single column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't. Most of the files are even created without page index. Sigh :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why not generate a test file yourself? There's probably an implementation that would allow to do it?

const std::vector<int32_t> max_values = {};

TestReadTypedColumnIndex<Int32Type>(
"datapage_v1-uncompressed-checksum.parquet", column_id, num_pages, boundary_order,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you read this file, you'll see that all data is actually non-null. So the column index in this file may be bogus?
(or you're reading it wrong?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the page index is corrupted. The parquet-cli has provided same reading:

➜ parquet-cli column-index datapage_v1-uncompressed-checksum.parquet
row-group 0:
column index for column a:
Boundary order: ASCENDING
                      null count  min                                       max
page-0                        -1  <none>                                    <none>
page-1                        -1  <none>                                    <none>

offset index for column a:
                          offset   compressed size       first row index
page-0                         4             10268                     0
page-1                     10272             10268                  2560

column index for column b:
Boundary order: ASCENDING
                      null count  min                                       max
page-0                        -1  <none>                                    <none>
page-1                        -1  <none>                                    <none>

offset index for column b:
                          offset   compressed size       first row index
page-0                     20540             10268                     0
page-1                     30808             10268                  2560

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a couple of minor changes, but it seems the tests are still lacking a bit. See comments.

cpp/src/parquet/page_index.cc Outdated Show resolved Hide resolved
@wgtmac wgtmac requested a review from pitrou December 13, 2022 14:41
@pitrou
Copy link
Member

pitrou commented Dec 13, 2022

Thanks a lot! I think this is good to go now, will just wait for CI.

@pitrou pitrou merged commit 56cf063 into apache:master Dec 13, 2022
@emkornfield
Copy link
Contributor

CC @fatemehp

@ursabot
Copy link

ursabot commented Dec 14, 2022

Benchmark runs are scheduled for baseline = fe5c9fe and contender = 56cf063. 56cf063 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.03% ⬆️0.0%] test-mac-arm
[Finished ⬇️2.99% ⬆️5.71%] ursa-i9-9960x
[Finished ⬇️1.33% ⬆️1.57%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 56cf063b ec2-t3-xlarge-us-east-2
[Finished] 56cf063b test-mac-arm
[Finished] 56cf063b ursa-i9-9960x
[Finished] 56cf063b ursa-thinkcentre-m75q
[Finished] fe5c9fe3 ec2-t3-xlarge-us-east-2
[Finished] fe5c9fe3 test-mac-arm
[Finished] fe5c9fe3 ursa-i9-9960x
[Finished] fe5c9fe3 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Dec 14, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants