Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10330: [Rust][DataFusion] Implement NULLIF() SQL function #8688

Closed

Conversation

velvia
Copy link
Contributor

@velvia velvia commented Nov 17, 2020

This PR implements the NULLIF() SQL function in DataFusion. It is implemented as a BuiltInScalarFunction, with a boolean kernel at the core which creates a new array with a modified null bitmap from the original array, based on the result of a boolean expression. When an input data item is equal to the right side in NULLIF(), then the item's nullity becomes set in the output array.

@velvia
Copy link
Contributor Author

velvia commented Nov 17, 2020

@andygrove @nevi-me would love to hear your feedback on this.... this addition has been useful to us internally.

@github-actions
Copy link

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went carefully through this, and it is really good. Thanks a lot @velvia.

I left some comments that IMO make the PR even more generic.

Would it be possible to update the section SQL Support on the README with this?

rust/arrow/src/compute/kernels/boolean.rs Show resolved Hide resolved
rust/arrow/src/compute/kernels/boolean.rs Outdated Show resolved Hide resolved
rust/datafusion/src/physical_plan/expressions.rs Outdated Show resolved Hide resolved
rust/datafusion/src/physical_plan/expressions.rs Outdated Show resolved Hide resolved
rust/datafusion/src/physical_plan/expressions.rs Outdated Show resolved Hide resolved
rust/datafusion/src/physical_plan/expressions.rs Outdated Show resolved Hide resolved
rust/datafusion/tests/sql.rs Outdated Show resolved Hide resolved
Comment on lines +1448 to +1610
pub static SUPPORTED_NULLIF_TYPES: &'static [DataType] = &[
DataType::Boolean,
DataType::UInt8,
DataType::UInt16,
DataType::UInt32,
DataType::UInt64,
DataType::Int8,
DataType::Int16,
DataType::Int32,
DataType::Int64,
DataType::Float32,
DataType::Float64,
];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to make these trait bounds with a good comment about how these are selected. I didn't understand:

/// The order of these types correspond to the order on which coercion applies

Can you explain it?

Copy link
Member

@jorgecarleitao jorgecarleitao Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to make these trait bounds with a good comment about how these are selected

AFAIK these cannot be trait bounds because logical and physical planning is dynamically typed.

In this case, this is enumerating all valid types that can be (dynamically) passed to the function. If someone tries to call this function with e.g. a ListArray, the logical planner will error with a description that this function does not support that type.

The order here matters because when a function is planned to be called with type X that is not supported by the function, the physical planner will try to (lossless) cast that type to a valid type for that functions, and it does so in the order of this array. In general these should be ordered from fastest to slowest (in the eyes of the implementation), so that the cast chooses the type with the fastest implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, in our private project at work, I have used type algebra definitions to not do these. For now, this can go like how it is, but later I can open a type algebra pr to convert all these to castability.

Copy link
Member

@jorgecarleitao jorgecarleitao Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is interesting. I would be interested in knowing what is the issue with the current implementation and why type algebra definitions should be used instead. Could you first introduce a proposal with the design e.g. on a google docs, before the PR? In DataFusion we have been doing that for larger changes to avoid committing to an implementation before some general agreement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh definitely will do, give me some time to wrap my head up.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience, @velvia.

I went through this more carefully, and I think that we have two issues:

  1. the null count of the returned array may not match the number of bits
  2. there may be some issue with arrays with offset != 0.

I left some comments in the code trying to describe what they are and how I tested them.

rust/arrow/src/compute/kernels/boolean.rs Outdated Show resolved Hide resolved
rust/arrow/src/compute/kernels/boolean.rs Show resolved Hide resolved
rust/arrow/src/compute/kernels/boolean.rs Show resolved Hide resolved
@velvia velvia force-pushed the evan/rust-datafusion-nullif-func branch from 4cf478c to 174db8e Compare November 23, 2020 21:57
@velvia
Copy link
Contributor Author

velvia commented Nov 23, 2020

@jorgecarleitao and others:

  • All outstanding issues should be resolved now, including the nonzero array offsets
  • Rebased to latest master as of today
  • I did not make the nullif kernel more generic, the reason is that I took @vertexclique 's suggestion to slice the left array data buffer, which allows us to work with the shifted/0-based null combined bitmaps. The result of this is that we must assume that the data buffers are primitive, otherwise shifting/slicing the data buffers might not work for non primitive types or be more complex. Let me know what you think.
    The alternative is to re-shift the combined bitmaps back to left.offset(), which might have space implications, but would be more generic (would not require any change of the left array's data buffers), and would have to wait for the other PR as well most likely.... I personally would prefer not to wait :)

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the effort. LGTM

It is fine to remain for primitive types for now.

@velvia
Copy link
Contributor Author

velvia commented Nov 27, 2020

How does it get merged from here? Sorry unclear on the process, thanks!

@jorgecarleitao
Copy link
Member

@velvia , just needs time from a committer: typically over the weekend I run through all PRs in order of appearenace and merge them if ready. Thanks a for the patience.

@nevi-me
Copy link
Contributor

nevi-me commented Nov 27, 2020

@jorgecarleitao I've just gone through this, and don't have anything else to add. Can I merge it?

@jorgecarleitao
Copy link
Member

@nevi-me , I went through it already. Thanks a lot!

@nevi-me nevi-me closed this in 2f77cc2 Nov 27, 2020
@velvia
Copy link
Contributor Author

velvia commented Nov 27, 2020

Thanks everyone!

nevi-me pushed a commit that referenced this pull request Nov 28, 2020
This fixes the CI.
Was introduced in #8688 , I guess because the CI there ran before the merge.

Closes #8791 from Dandandan/fix_clone

Authored-by: Heres, Daniel <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR implements the NULLIF() SQL function in DataFusion.  It is implemented as a BuiltInScalarFunction, with a boolean kernel at the core which creates a new array with a modified null bitmap from the original array, based on the result of a boolean expression.   When an input data item is equal to the right side in NULLIF(), then the item's nullity becomes set in the output array.

Closes apache#8688 from velvia/evan/rust-datafusion-nullif-func

Lead-authored-by: Evan Chan <[email protected]>
Co-authored-by: Evan Chan <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This fixes the CI.
Was introduced in apache#8688 , I guess because the CI there ran before the merge.

Closes apache#8791 from Dandandan/fix_clone

Authored-by: Heres, Daniel <[email protected]>
Signed-off-by: Neville Dipale <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants