Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatibility with ORC files written in version 0.12 due to missing hasNull field in C++ Reader #2079

Open
suxiaogang223 opened this issue Dec 9, 2024 · 3 comments · May be fixed by #2082

Comments

@suxiaogang223
Copy link

Issue Description:

From Hive 1.1.0 onwards, the column statistics will also record if there are any null values within the row group by setting the hasNull flag. The hasNull flag is used by ORC’s predicate pushdown to better answer ‘IS NULL’ queries.

We encountered an issue with the C++ implementation of the ORC reader when handling ORC files written with version 0.12. Specifically, files written in this version do not include the hasNull field in the column statistics metadata. While the Java implementation of the ORC reader handles this gracefully by defaulting hasNull to true when the field is absent, the C++ implementation does not handle this scenario correctly.
This issue prevents predicates like IS NULL from being pushed down to the ORC reader!!! As a result, all rows in the file are filtered out, leading to incorrect query results :(

Steps to Reproduce:

  1. Use an ORC file written with ORC version 0.12 (without hasNull in its column statistics).
  2. Attempt to read the file using the C++ ORC reader.

Expected Behavior:

The C++ ORC reader should default the hasNull field to true when it is absent, ensuring compatibility with older file versions.

Observed Behavior:

The C++ ORC reader default the hasNull field to false, resulting in incorrect metadata interpretation.

Comparison with Java Implementation:

The Java implementation includes the following logic:

if (stats.hasHasNull()) {
    hasNull = stats.getHasNull();
} else {
    hasNull = true;
}

In contrast, the C++ implementation directly uses the has_null value without any fallback logic:

ColumnStatisticsImpl::ColumnStatisticsImpl(const proto::ColumnStatistics& pb) {
    stats_.setNumberOfValues(pb.number_of_values());
    stats_.setHasNull(pb.has_null());
}

Suggested Fix:

Introduce fallback logic in the C++ reader to set hasNull to true when the field is missing, similar to the Java implementation.

@suxiaogang223
Copy link
Author

We are temporarily fixing this problem by applying a patch.
apache/doris-thirdparty#259

@ffacs
Copy link
Contributor

ffacs commented Dec 10, 2024

HI @suxiaogang223 , there is a patch #2055 to disable PPD when pb.has_hasnull() == false, but ColumnStatistics::hasNull still return false when the field in pb is missing. It Seems your patch would make it return a right value, would you like to make a PR? CC @wgtmac

@suxiaogang223
Copy link
Author

HI @suxiaogang223 , there is a patch #2055 to disable PPD when pb.has_hasnull() == false, but ColumnStatistics::hasNull still return false when the field in pb is missing. It Seems your patch would make it return a right value, would you like to make a PR? CC @wgtmac

Done, please review this pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants