-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-18420: [C++][Parquet] Introduce ColumnIndex & OffsetIndex #14803
Conversation
@pitrou @emkornfield Can you please take a look at the proposed interface and implementation? Once they are finalized, I will add unit tests. Thank you very much! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on the API, I didn't look at the implementation.
cpp/src/parquet/page_index.h
Outdated
/// \brief The maximum value of every valid page. | ||
virtual const std::vector<T>& GetMaxValues() const = 0; | ||
|
||
/// \brief Returns list of page index of all valid pages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it useful to have this in addition from null_pages
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it easy to know which page_id is valid when the return values from line 92 are used.
In the last commit, I have removed all the per-page accessors. They are now actually thin wrappers to the corresponding thrift class. How about this? @emkornfield @pitrou |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @wgtmac . Here are some more comments.
cpp/src/parquet/page_index.cc
Outdated
default: | ||
break; | ||
} | ||
DCHECK(false) << "Should not be able to reach this code"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use arrow::Unreachable()
from arrow/util/unreachable.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead can we remove the default from the switch? or at least throw an exception, crashing seems like the wrong choice if a new enum is added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead can we remove the default from the switch? or at least throw an exception, crashing seems like the wrong choice if a new enum is added.
Agreed. In this way, the developer gets noticed there is something to fix at compile time instead of run time.
I just realized: in this API, |
Yes, the encoded values are directly populated from the thrift ColumnIndex object, with null pages filled by an empty string. The decoded values have excluded null pages so it is much easier to do binary search if the boundary order is ASC or DESC without considering null pages. This is especially effective to implement predicate push down. |
After checking the potential use cases of |
Thanks @wgtmac . Is it possible to add some tests for this? One option is to rely on data files from https://github.com/apache/parquet-testing/tree/master/data (you might also upload your own there if desired) |
I have added |
cpp/src/parquet/page_index_test.cc
Outdated
const std::vector<size_t> page_indices = {0, 1}; | ||
const std::vector<bool> null_pages = {true, true}; | ||
// It seems that the null_counts are malformed. | ||
const bool has_null_counts = true; | ||
const std::vector<int64_t> null_counts = {-1, -1}; | ||
const std::vector<int32_t> min_values = {}; | ||
const std::vector<int32_t> max_values = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a bad example to be honest. Can you find or generate a test case which has both null and non-null pages in a single column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't. Most of the files are even created without page index. Sigh :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why not generate a test file yourself? There's probably an implementation that would allow to do it?
cpp/src/parquet/page_index_test.cc
Outdated
const std::vector<int32_t> max_values = {}; | ||
|
||
TestReadTypedColumnIndex<Int32Type>( | ||
"datapage_v1-uncompressed-checksum.parquet", column_id, num_pages, boundary_order, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you read this file, you'll see that all data is actually non-null. So the column index in this file may be bogus?
(or you're reading it wrong?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the page index is corrupted. The parquet-cli has provided same reading:
➜ parquet-cli column-index datapage_v1-uncompressed-checksum.parquet
row-group 0:
column index for column a:
Boundary order: ASCENDING
null count min max
page-0 -1 <none> <none>
page-1 -1 <none> <none>
offset index for column a:
offset compressed size first row index
page-0 4 10268 0
page-1 10272 10268 2560
column index for column b:
Boundary order: ASCENDING
null count min max
page-0 -1 <none> <none>
page-1 -1 <none> <none>
offset index for column b:
offset compressed size first row index
page-0 20540 10268 0
page-1 30808 10268 2560
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a couple of minor changes, but it seems the tests are still lacking a bit. See comments.
Thanks a lot! I think this is good to go now, will just wait for CI. |
CC @fatemehp |
Benchmark runs are scheduled for baseline = fe5c9fe and contender = 56cf063. 56cf063 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
Basically this patch has defined the interface of
ColumnIndex
andOffsetIndex
. Implementation classes are also provided to deserialize byte stream, wrap thrift message and provide access to their attributes.BTW, the naming style throughout the code base looks confusing to me. I have tried to follow what I have understood from the parquet sub-directory. Please correct me if anything is incorrect.