Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-14844: [Java] Short circuit null checks when comparing non null field types #15106

Merged
merged 4 commits into from
Dec 30, 2022

Conversation

markjschreiber
Copy link
Contributor

@markjschreiber markjschreiber commented Dec 28, 2022

Relates to #14844

To avoid expensive checks for null values when the vectors cannot contain nulls, a flag is set when a vector is attached to indicate if a null check is needed when compare(idx1, idx2) is called. If it isn't then the call will immediately redirect to compareNotNull(index1, index2).

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of old issues on JIRA the title also supports:

ARROW-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@markjschreiber markjschreiber changed the title Short circuit null checks when comparing non null field types GH-14844: [Java] Short circuit null checks when comparing non null field types Dec 28, 2022
@github-actions
Copy link

⚠️ GitHub issue #14844 has been automatically assigned in GitHub to PR creator.

Comment on lines 92 to 93
final boolean v1MayHaveNulls = vector1.getField().getFieldType().isNullable();
final boolean v2MayHaveNulls = vector2.getField().getFieldType().isNullable();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be preferable to check for the presence/absence of a validity buffer (I think Java does the same optimization as C++ here) - that'll cover cases where a nullable vector has no nulls, too.

It won't cover cases where a non-nullable vector has a validity buffer of all 1s, but as-is I'd say we may not want to enable this by default since nothing is really verifying that the data matches that part of the schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the cases I have tested the java vectors seem to have a validity buffer even when there can be no nulls. Depending on the AllocationManager this is some form of buffer called EMPTY which is not null but cannot allocate memory for items. It's not possible the test if the buffer is equal to one of these because they are private

However, given that someone might change the implementation one day I will include a check for a null validity buffer to catch this case as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's what I wasn't remembering. In that case, the right check might be to see if the buffer length is 0 (in which case there can't be any nulls)?

Copy link
Contributor Author

@markjschreiber markjschreiber Dec 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weirdly it seems that the validity buffer does have a capacity set when any values are added to the value buffer. There is a method to find the number of nulls in the vector which is relatively fast, only needs to be checked when attaching the vector and can be skipped for FieldTypes that are nonNullable

Turns out I also need to add a safety check that uses the "null safe" comparison when the vector contains no values (valueCount = 0) as the TreeBasedDictionaryBuilder seems to need this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for working through it. I suppose Java doesn't optimize the validity buffer in the same way that C++ does.

@cyb70289
Copy link
Contributor

@markjschreiber ping me if you need CI approval. Most maintainers are in vacation.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lidavidm lidavidm merged commit d00f016 into apache:master Dec 30, 2022
@ursabot
Copy link

ursabot commented Dec 30, 2022

Benchmark runs are scheduled for baseline = 47c1ff1 and contender = d00f016. d00f016 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] test-mac-arm
[Finished ⬇️1.79% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.52% ⬆️0.07%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] d00f0163 ec2-t3-xlarge-us-east-2
[Finished] d00f0163 test-mac-arm
[Finished] d00f0163 ursa-i9-9960x
[Finished] d00f0163 ursa-thinkcentre-m75q
[Finished] 47c1ff18 ec2-t3-xlarge-us-east-2
[Finished] 47c1ff18 test-mac-arm
[Finished] 47c1ff18 ursa-i9-9960x
[Finished] 47c1ff18 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Dec 30, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

@markjschreiber markjschreiber deleted the no-null-check-comparator branch December 31, 2022 15:50
EpsilonPrime pushed a commit to EpsilonPrime/arrow that referenced this pull request Jan 5, 2023
…ull field types (apache#15106)

Relates to apache#14844 

To avoid expensive checks for null values when the vectors cannot contain nulls, a flag is set when a vector is attached to indicate if a null check is needed when `compare(idx1, idx2)` is called. If it isn't then the call will immediately redirect to `compareNotNull(index1, index2)`.
* Closes: apache#14844

Authored-by: Mark Schreiber <[email protected]>
Signed-off-by: David Li <[email protected]>
EpsilonPrime pushed a commit to EpsilonPrime/arrow that referenced this pull request Jan 5, 2023
…ull field types (apache#15106)

Relates to apache#14844 

To avoid expensive checks for null values when the vectors cannot contain nulls, a flag is set when a vector is attached to indicate if a null check is needed when `compare(idx1, idx2)` is called. If it isn't then the call will immediately redirect to `compareNotNull(index1, index2)`.
* Closes: apache#14844

Authored-by: Mark Schreiber <[email protected]>
Signed-off-by: David Li <[email protected]>
Yicong-Huang added a commit to Texera/texera that referenced this pull request May 5, 2023
This PR bumps Apache Arrow version from 10.0.0 to 11.0.0.

Main changes related to PyAmber:

## Java/Scala side:

- Distribute Apple M1 compatible JNI libraries via mavencentral
([#14472](apache/arrow#14472)).
- Improve performance by short-circuiting null checks when comparing non
null field types ([#15106](apache/arrow#15106)).
- Extend Table copy functionality, and support returning copies of
individual vectors
([#14389](apache/arrow#14389)).
- Several enhancements to dictionary encoding
([#14891](apache/arrow#14891),
([#14902](apache/arrow#14902),
([#14874](apache/arrow#14874)).
- Extend Table to support additional vector types
([#14573](apache/arrow#14573)).
- Enhance and simplify handling of allocation management by integrating
C Data into allocator hierarchy
([#14506](apache/arrow#14506)).

## Python side:
- PyArrow now requires pandas >= 1.0
([ARROW-18173](https://issues.apache.org/jira/browse/ARROW-18173)).
- Added support for the [DataFrame Interchange
Protocol](https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html)
for pyarrow.Table
([GH-33346](apache/arrow#33346)).
- Support for custom metadata of record batches in the IPC read and
write APIs
([ARROW-16430](https://issues.apache.org/jira/browse/ARROW-16430)).
- The Time32Scalar, Time64Scalar, Date32Scalar and Date64Scalar classes
got a .value attribute to access the underlying integer value, similar
to the other date-time related scalars
([ARROW-18264](https://issues.apache.org/jira/browse/ARROW-18264)).
- Casting to string is now supported for duration
([ARROW-15822](https://issues.apache.org/jira/browse/ARROW-15822)) and
decimal
([ARROW-17458](https://issues.apache.org/jira/browse/ARROW-17458))
types, which also means those can now be written to CSV.

## Issues fixed:
- Now Do_action (from Python server back to Java Client) is returning a
stream of results properly, and it alerts when the results are not fully
consumed by the client. Such results will be used to send the flow
control credits back from the Python side. We limit the results to be
exact 1 for now, although it can be a stream.
- Fix a bug in the Python proxy server, when unregistered action is
invoked, it should not parse and return the results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Java] VectorValueComparator should skip the null test for NonNullable FieldTypes types
4 participants