Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Fix NULL handling in array_slice, introduce NullHandling enum to Signature #14289

Merged
merged 13 commits into from
Feb 3, 2025

Conversation

jkosh44
Copy link
Contributor

@jkosh44 jkosh44 commented Jan 25, 2025

This commit fixes the array_slice function so that if any arguments are NULL, the result is NULL. Previously, array_slice would return an internal error if any of the arguments were NULL. This behavior matches the behavior of DuckDB for array_slice.

Fixes #10548

Which issue does this PR close?

Closes #10548.

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, previously a user would be returned an error, now they are returned a NULL value for array_slice with NULL input.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 25, 2025
@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 25, 2025

It looks like a lot of the array functions don't properly handle NULLs. For example (I did not exhaustively test them all),

> SELECT array_sort([1,2,3], NULL);
Internal error: could not cast value to arrow_array::array::byte_array::GenericByteArray<arrow_array::types::GenericStringType<i32>>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

> SELECT array_resize([1,2,3], NULL, 0);
Internal error: could not cast value to arrow_array::array::primitive_array::PrimitiveArray<arrow_array::types::Int64Type>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

> SELECT array_replace([1,2,3], NULL, NULL);
thread 'main' panicked at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-54.0.0/src/transform/mod.rs:418:13:
assertion `left == right` failed: Arrays with inconsistent types passed to MutableArrayData
  left: Int64
 right: Null
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/std/src/panicking.rs:665:5
   1: core::panicking::panic_fmt
             at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/library/core/src/panicking.rs:76:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panicking.rs:373:5
   4: arrow_data::transform::MutableArrayData::with_capacities
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-54.0.0/src/transform/mod.rs:418:13
   5: datafusion_functions_nested::replace::general_replace
             at /home/joe/Projects/datafusion/datafusion/functions-nested/src/replace.rs:303:23
   6: datafusion_functions_nested::replace::array_replace_inner
             at /home/joe/Projects/datafusion/datafusion/functions-nested/src/replace.rs:394:13
   7: core::ops::function::Fn::call
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:79:5
   8: datafusion_functions_nested::utils::make_scalar_function::{{closure}}
             at /home/joe/Projects/datafusion/datafusion/functions-nested/src/utils.rs:72:22
   9: <datafusion_functions_nested::replace::ArrayReplace as datafusion_expr::udf::ScalarUDFImpl>::invoke_batch
             at /home/joe/Projects/datafusion/datafusion/functions-nested/src/replace.rs:123:9
  10: datafusion_expr::udf::ScalarUDFImpl::invoke_with_args
             at /home/joe/Projects/datafusion/datafusion/expr/src/udf.rs:643:9
  11: datafusion_expr::udf::ScalarUDF::invoke_with_args
             at /home/joe/Projects/datafusion/datafusion/expr/src/udf.rs:237:9
  12: <datafusion_physical_expr::scalar_function::ScalarFunctionExpr as datafusion_physical_expr_common::physical_expr::PhysicalExpr>::evaluate
             at /home/joe/Projects/datafusion/datafusion/physical-expr/src/scalar_function.rs:195:22
  13: datafusion_optimizer::simplify_expressions::expr_simplifier::ConstEvaluator::evaluate_to_scalar
             at /home/joe/Projects/datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:639:29
  14: <datafusion_optimizer::simplify_expressions::expr_simplifier::ConstEvaluator as datafusion_common::tree_node::TreeNodeRewriter>::f_up
             at /home/joe/Projects/datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:529:30
  15: datafusion_common::tree_node::TreeNode::rewrite::{{closure}}::{{closure}}
             at /home/joe/Projects/datafusion/datafusion/common/src/tree_node.rs:183:13
  16: datafusion_common::tree_node::Transformed<T>::transform_parent
             at /home/joe/Projects/datafusion/datafusion/common/src/tree_node.rs:763:44
  17: datafusion_common::tree_node::TreeNode::rewrite::{{closure}}
             at /home/joe/Projects/datafusion/datafusion/common/src/tree_node.rs:28:9
  18: stacker::maybe_grow
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/stacker-0.1.17/src/lib.rs:55:9
  19: datafusion_common::tree_node::TreeNode::rewrite
             at /home/joe/Projects/datafusion/datafusion/common/src/tree_node.rs:177:50
  20: datafusion_optimizer::simplify_expressions::expr_simplifier::ExprSimplifier<S>::simplify_with_cycle_count
             at /home/joe/Projects/datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:211:17
  21: datafusion_optimizer::simplify_expressions::expr_simplifier::ExprSimplifier<S>::simplify
             at /home/joe/Projects/datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:184:12
  22: datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions::optimize_internal::{{closure}}
             at /home/joe/Projects/datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:127:25
  23: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:294:13
  24: <datafusion_expr::expr::Expr as datafusion_common::tree_node::TreeNodeContainer<datafusion_expr::expr::Expr>>::map_elements
             at /home/joe/Projects/datafusion/datafusion/expr/src/expr.rs:366:9
  25: <alloc::vec::Vec<C> as datafusion_common::tree_node::TreeNodeContainer<T>>::map_elements::{{closure}}
             at /home/joe/Projects/datafusion/datafusion/common/src/tree_node.rs:873:21
  26: core::iter::adapters::map::map_try_fold::{{closure}}
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/adapters/map.rs:95:28
  27: <alloc::vec::into_iter::IntoIter<T,A> as core::iter::traits::iterator::Iterator>::try_fold
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/into_iter.rs:346:25
  28: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/adapters/map.rs:121:9
  29: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::try_fold
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/adapters/mod.rs:191:9
  30: <I as alloc::vec::in_place_collect::SpecInPlaceCollect<T,I>>::collect_in_place
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/in_place_collect.rs:380:13
  31: alloc::vec::in_place_collect::from_iter_in_place
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/in_place_collect.rs:271:9
  32: alloc::vec::in_place_collect::from_iter_in_place{{reify.shim}}
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/in_place_collect.rs:251:1
  33: alloc::vec::in_place_collect::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/in_place_collect.rs:246:9
  34: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:3412:9
  35: core::iter::traits::iterator::Iterator::collect
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs:1971:9
  36: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter::{{closure}}
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/result.rs:1980:51
  37: core::iter::adapters::try_process
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/adapters/mod.rs:160:17
  38: <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/result.rs:1980:9
  39: core::iter::traits::iterator::Iterator::collect
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/iter/traits/iterator.rs:1971:9
  40: <alloc::vec::Vec<C> as datafusion_common::tree_node::TreeNodeContainer<T>>::map_elements
             at /home/joe/Projects/datafusion/datafusion/common/src/tree_node.rs:870:9
  41: datafusion_expr::logical_plan::tree_node::<impl datafusion_expr::logical_plan::plan::LogicalPlan>::map_expressions
             at /home/joe/Projects/datafusion/datafusion/expr/src/logical_plan/tree_node.rs:498:19
  42: datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions::optimize_internal
             at /home/joe/Projects/datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:125:9
  43: <datafusion_optimizer::simplify_expressions::simplify_exprs::SimplifyExpressions as datafusion_optimizer::optimizer::OptimizerRule>::rewrite
             at /home/joe/Projects/datafusion/datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:73:9
  44: datafusion_optimizer::optimizer::optimize_plan_node
             at /home/joe/Projects/datafusion/datafusion/optimizer/src/optimizer.rs:332:16
  45: <datafusion_optimizer::optimizer::Rewriter as datafusion_common::tree_node::TreeNodeRewriter>::f_up
             at /home/joe/Projects/datafusion/datafusion/optimizer/src/optimizer.rs:318:13
  46: datafusion_expr::logical_plan::tree_node::<impl datafusion_expr::logical_plan::plan::LogicalPlan>::rewrite_with_subqueries::{{closure}}::{{closure}}
             at /home/joe/Projects/datafusion/datafusion/expr/src/logical_plan/tree_node.rs:698:17
  47: datafusion_common::tree_node::Transformed<T>::transform_parent
             at /home/joe/Projects/datafusion/datafusion/common/src/tree_node.rs:763:44
  48: datafusion_expr::logical_plan::tree_node::<impl datafusion_expr::logical_plan::plan::LogicalPlan>::rewrite_with_subqueries::{{closure}}
             at /home/joe/Projects/datafusion/datafusion/expr/src/logical_plan/tree_node.rs:386:9
  49: stacker::maybe_grow
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/stacker-0.1.17/src/lib.rs:55:9
  50: datafusion_expr::logical_plan::tree_node::<impl datafusion_expr::logical_plan::plan::LogicalPlan>::rewrite_with_subqueries
             at /home/joe/Projects/datafusion/datafusion/expr/src/logical_plan/tree_node.rs:690:50
  51: datafusion_optimizer::optimizer::Optimizer::optimize
             at /home/joe/Projects/datafusion/datafusion/optimizer/src/optimizer.rs:388:42
  52: datafusion::execution::session_state::SessionState::optimize
             at /home/joe/Projects/datafusion/datafusion/core/src/execution/session_state.rs:714:13
  53: datafusion::execution::session_state::SessionState::create_physical_plan::{{closure}}
             at /home/joe/Projects/datafusion/datafusion/core/src/execution/session_state.rs:732:28
  54: datafusion::dataframe::DataFrame::create_physical_plan::{{closure}}
             at /home/joe/Projects/datafusion/datafusion/core/src/dataframe/mod.rs:228:61
  55: datafusion_cli::exec::exec_and_print::{{closure}}
             at ./src/exec.rs:236:55
  56: datafusion_cli::exec::exec_from_repl::{{closure}}::{{closure}}
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/macros/select.rs:557:49
  57: <core::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/poll_fn.rs:151:9
  58: datafusion_cli::exec::exec_from_repl::{{closure}}
             at ./src/exec.rs:176:21
  59: datafusion_cli::main_inner::{{closure}}
             at ./src/main.rs:214:14
  60: datafusion_cli::main::{{closure}}
             at ./src/main.rs:131:34
  61: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/future.rs:124:9
  62: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/park.rs:284:63
  63: tokio::runtime::coop::with_budget
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/coop.rs:107:5
  64: tokio::runtime::coop::budget
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/coop.rs:73:5
  65: tokio::runtime::park::CachedParkThread::block_on
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/park.rs:284:31
  66: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/context/blocking.rs:66:9
  67: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
  68: tokio::runtime::context::runtime::enter_runtime
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/context/runtime.rs:65:16
  69: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
  70: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/runtime.rs:370:45
  71: tokio::runtime::runtime::Runtime::block_on
             at /home/joe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/runtime.rs:340:13
  72: datafusion_cli::main
             at ./src/main.rs:131:5
  73: core::ops::function::FnOnce::call_once
             at /home/joe/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
killed

Process finished with exit code 9

An alternative approach would be to add a function like the following to the ScalarUDFImpl trait

    /// Returns true if the function should return NULL when any of the arguments are NULL, false
    /// otherwise.
    fn propagates_nulls(&self) -> bool;

Then we could handle this for all functions in the same place at some higher level. Maybe somewhere like make_scalar_function.

This commit fixes the array_slice function so that if any arguments are
NULL, the result is NULL. Previously, array_slice would return an
internal error if any of the arguments were NULL. This behavior matches
the behavior of DuckDB for array_slice.

Fixes apache#10548
@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 25, 2025

An alternative approach would be to add a function like the following to the ScalarUDFImpl trait

   /// Returns true if the function should return NULL when any of the arguments are NULL, false
   /// otherwise.
   fn propagates_nulls(&self) -> bool;

Then we could handle this for all functions in the same place at some higher level. Maybe somewhere like make_scalar_function.

Though, that would skip some of the error checking that happens inside of the function implementation which wouldn't be great.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jan 25, 2025

I think another approach is to introduce optimizer rule that transform any function that contains null to null in ConstEvaluator. Is there any function that doesn't return null but has specialized handling logic?

We could handle such nulls handling in ScalarFunctionExpr::evaluate

fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
let args = self
.args
.iter()
.map(|e| e.evaluate(batch))
.collect::<Result<Vec<_>>>()?;
let input_empty = args.is_empty();
let input_all_scalar = args
.iter()
.all(|arg| matches!(arg, ColumnarValue::Scalar(_)));
// evaluate the function
let output = self.fun.invoke_with_args(ScalarFunctionArgs {
args,
number_rows: batch.num_rows(),
return_type: &self.return_type,
})?;
if let ColumnarValue::Array(array) = &output {
if array.len() != batch.num_rows() {
// If the arguments are a non-empty slice of scalar values, we can assume that
// returning a one-element array is equivalent to returning a scalar.
let preserve_scalar =
array.len() == 1 && !input_empty && input_all_scalar;
return if preserve_scalar {
ScalarValue::try_from_array(array, 0).map(ColumnarValue::Scalar)
} else {
internal_err!("UDF {} returned a different number of rows than expected. Expected: {}, Got: {}",
self.name, batch.num_rows(), array.len())
};
}
}
Ok(output)
}

@alamb
Copy link
Contributor

alamb commented Jan 25, 2025

We could handle such nulls handling in ScalarFunctionExpr::evaluate

Most SQL functions are "strict" in the sense that if any of their inputs are null they produce output

There are a few exceptions like coalesce

I think it is important that people be able to write user defined functions that are not null strict

(Edit updated to say STRICT)

@jayzhan211
Copy link
Contributor

Maybe we need yet another trait implementation

trait ScalarUDFImpl {
   fn handle_nulls(&self, args: ScalarFunctionArgs) -> Result<Option<ColumnarValue>> {
    // most of the case if any null exist, we return null
   }
}


for coalesce:
fn handle_nulls(&self, args: ScalarFunctionArgs) -> Result<Option<ColumnarValue>> {
  // handle it inside the funciton
  None
}

@alamb
Copy link
Contributor

alamb commented Jan 25, 2025

Maybe we need yet another trait implementation

I think it could potentially be modeled as a field on Signature https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.Signature.html which would minimize the breakages

Some prior art:

It appears that spark uses the term non nullable
https://spark.apache.org/docs/3.5.3/sql-ref-functions-udf-scalar.html

I think postgres uses the term STRICT -- from he docs: https://www.postgresql.org/docs/current/sql-createfunction.html

CALLED ON NULL INPUT
RETURNS NULL ON NULL INPUT
STRICT
CALLED ON NULL INPUT (the default) indicates that the function will be called normally when some of its arguments are null. It is then the function author's responsibility to check for null values if necessary and respond appropriately.

RETURNS NULL ON NULL INPUT or STRICT indicates that the function always returns null whenever any of its arguments are null. If this parameter is specified, the function is not executed when there are null arguments; instead a null result is assumed automatically.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 25, 2025

Maybe we need yet another trait implementation

One complication with this approach is that we might end up skipping over error checks that are inside of the function implementation. For example,

_ => exec_err!("array_slice does not support type: {:?}", array_data_type),
.

So instead of SELECT array_slice(1.5, NULL, NULL) returning an error for an unsupported type in the first argument, it will return NULL. The DuckDB behavior is to return an error here.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 25, 2025

I made a couple of changes to this PR in the second commit. Previously, I was getting an optimizer error under certain scenarios, for example,

> select array_slice([1,2,3], NULL, NULL);
Optimizer rule 'optimize_projections' failed
caused by
Check optimizer-specific invariants after optimizer rule: optimize_projections
caused by
Internal error: Failed due to a difference in schemas, original schema: DFSchema { inner: Schema { fields: [Field { name: "make_array(Int64(1),Int64(2),Int64(3))[NULL:NULL]", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None], functional_dependencies: FunctionalDependencies { deps: [] } }, new schema: DFSchema { inner: Schema { fields: [Field { name: "make_array(Int64(1),Int64(2),Int64(3))[NULL:NULL]", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None], functional_dependencies: FunctionalDependencies { deps: [] } }.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

Strangely, select array_slice([1,2,3], NULL, NULL) is NULL returned true, which is why the tests were passing. The second commit fixed this issue and updated the tests. It turns out there were already tests that were asserting NULLs would cause errors, so I switched those to assert for NULLs.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jan 26, 2025

So instead of SELECT array_slice(1.5, NULL, NULL) returning an error for an unsupported type in the first argument, it will return NULL

This is because the signature for extract doesn't handle type checking correctly, it uses variadic_any, not because of the introduced new trait method

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jan 26, 2025

Maybe we need yet another trait implementation

I think it could potentially be modeled as a field on Signature https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.Signature.html which would minimize the breakages

Some prior art:

It appears that spark uses the term non nullable https://spark.apache.org/docs/3.5.3/sql-ref-functions-udf-scalar.html

I think postgres uses the term STRICT -- from he docs: https://www.postgresql.org/docs/current/sql-createfunction.html

CALLED ON NULL INPUT
RETURNS NULL ON NULL INPUT
STRICT
CALLED ON NULL INPUT (the default) indicates that the function will be called normally when some of its arguments are null. It is then the function author's responsibility to check for null values if necessary and respond appropriately.

RETURNS NULL ON NULL INPUT or STRICT indicates that the function always returns null whenever any of its arguments are null. If this parameter is specified, the function is not executed when there are null arguments; instead a null result is assumed automatically.

I prefer to has Strict mode by default -- returns nulls if the arguments contains nulls.

We can add field handle_nulls: bool to Signature, false by default, the user can easily change to true and handle nulls by themselves.

Also, I think we can rewrite such function if handle_nulls is false in optimizer, we can check the argument is null or not easily given it should be a Scalar, then we can rewrite it to Null. Or even rewrite it as early as possible in Planner🤔 We need to consider the case that we call evaluate on ScalarFunctionExpr..

@github-actions github-actions bot added the physical-expr Physical Expressions label Jan 26, 2025
@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 26, 2025

This is because the signature for extract doesn't handle type checking correctly, it uses variadic_any, not because of the introduced new trait method

Oh, great then I don't think my concern is actually valid.

I prefer to has Strict mode by default -- returns nulls if the arguments contains nulls.

We can add field handle_nulls: bool to Signature, false by default, the user can easily change to true and handle nulls by themselves.

I pushed a PoC for an implementation using strict in the Signature. I had called it strict and made the default the opposite of what you suggested because I implemented it before I saw your comment. My thought process for the opposite default was that it would be more backwards compatible, otherwise we'd have to check every function that really does not want to be strict. I can easily change it to the reverse though.

@@ -330,7 +330,8 @@ pub(super) struct ArraySlice {
impl ArraySlice {
pub fn new() -> Self {
Self {
signature: Signature::variadic_any(Volatility::Immutable),
// TODO: This signature should use the actual accepted types, not variadic_any.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to figure out a way using the current TypeSignature to create a type signature that accepts (any list type, i64, i64). My only idea is to extend ArrayFunctionSignature with a variant like ArrayAndElements(NonZeroUsize).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach, which probably warrants it's own issue/PR, is to add something similar to PostgreSQL psuedo-types (https://www.postgresql.org/docs/current/datatype-pseudo.html). Then I can make a signature of something like

TypeSignature::OneOf(vec![
    TypeSignature::Exact(vec![AnyArray, Int64, Int64]),
    TypeSignature::Exact(vec![AnyArray, Int64, Int64, Int64]),
])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only idea is to extend ArrayFunctionSignature with a variant like ArrayAndElements(NonZeroUsize)

I think this is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed an update to do this.

@rluvaton
Copy link
Contributor

I don't like the name strict as it can mean different things (like fail on parsing invalid strings), I think it should be an enum on null handling

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 28, 2025

I don't like the name strict as it can mean different things (like fail on parsing invalid strings), I think it should be an enum on null handling

How about something like

/// How a function handles Null input.
enum NullHandling {
    /// Pass Null inputs into the function implementation.
    PassThrough,
    /// Any Null input causes the function to return Null.
    Propogate,
}

? I'm happy to hear any bike shedding on the names because I'm not really familiar with the project's existing style.

I prefer to has Strict mode by default -- returns nulls if the arguments contains nulls.

I've been thinking about this some more, and I'm more convinced that the we should set the default to not strict or NullHandling::PassThrough. Otherwise, not only would the behavior of all of our functions silently change, but the behavior of all user's UDFs would also change. So setting it to strict by default feels like too big of a breaking change. Though, this is my first PR and I'm not super familiar with the project's backwards compatibility guarantees, so I'll default to a reviewer.

@jayzhan211
Copy link
Contributor

/// How a function handles Null input.
enum NullHandling {
    /// Pass Null inputs into the function implementation.
    PassThrough,
    /// Any Null input causes the function to return Null.
    Propogate,
}

This looks great. NullHandling::PassThrough could be the default, I don't have strong opinion on the default setting

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Jan 29, 2025
@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 29, 2025

/// Any Null input causes the function to return Null.
    Propogate,

I've updated the PR to use this. I just realized though that window and aggregate functions (and maybe other function types?) completely ignore this field in the Signature. That feels wrong, but I'll have to think about what the right behavior should be.

@@ -186,6 +187,15 @@ impl PhysicalExpr for ScalarFunctionExpr {
.map(|e| e.evaluate(batch))
.collect::<Result<Vec<_>>>()?;

if self.fun.signature().null_handling == NullHandling::Propagate
&& args.iter().any(
|arg| matches!(arg, ColumnarValue::Scalar(scalar) if scalar.is_null()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super confident about this check, how should ColumnarValue::Arrays be treated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I understand this now. If the function is called with a single set of arguments then each arg will be a ColumnarValue::Scalar. However, if the function is called on a batch of arguments, then each arg will be a ColumnarValue::Array of all the arguments. So this does not work in the batch case.

What we'd really like is to identify all indexes, i, s.t. one of the args at index i is Null. Then somehow skip all rows at the identified indexes and immediately return Null for those. That seems a little tricky because it looks like we pass the entire ArrayRef to the function implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I understand this now. If the function is called with a single set of arguments then each arg will be a ColumnarValue::Scalar. However, if the function is called on a batch of arguments, then each arg will be a ColumnarValue::Array of all the arguments. So this does not work in the batch case.

What we'd really like is to identify all indexes, i, s.t. one of the args at index i is Null. Then somehow skip all rows at the identified indexes and immediately return Null for those. That seems a little tricky because it looks like we pass the entire ArrayRef to the function implementation.

I don't think we need to peek the null for column case, they should be specific logic handled for each function. For scalar case, since most of the scalar function returns null if any one of args is null, it is beneficial to introduce another null handling method. It is just convenient method nice to have but not the must have solution for null handling since they can be handled in 'invoke' as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just convenient method nice to have but not the must have solution for null handling since they can be handled in 'invoke' as well.

If someone forgets to handle nulls in invoke, then don't we run the risk of accidentally returning different results depending on if the function was called with scalar arguments or with a batch of arguments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm not sure I understand the point of the check here, if the invoke implementation also has to handle nulls, but maybe I'm misunderstanding what you're saying.

Copy link
Contributor

@jayzhan211 jayzhan211 Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scalar_function(null, null, ...) and scalar_function(column_contains_null, ...). We can only handling nulls but not column_contains_null because we don't now the data in the column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, I don't think I understand your response. So I should leave this code block here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 30, 2025

/// Any Null input causes the function to return Null.
    Propogate,

I've updated the PR to use this. I just realized though that window and aggregate functions (and maybe other function types?) completely ignore this field in the Signature. That feels wrong, but I'll have to think about what the right behavior should be.

In PostgreSQL, table functions follow the strict field by returning 0 rows:

postgres=# CREATE FUNCTION test_table(i1 INT, i2 INT)
RETURNS TABLE(i INT) 
STRICT
LANGUAGE plpgsql AS $$
BEGIN
    RETURN QUERY SELECT i1 UNION ALL SELECT i2;
END;
$$;
CREATE FUNCTION
postgres=# SELECT test_table(42, 43);
 test_table 
------------
         42
         43
(2 rows)

postgres=# SELECT test_table(42, NULL);
 test_table 
------------
(0 rows)

postgres=# SELECT test_table(NULL, 42);
 test_table 
------------
(0 rows)

postgres=# SELECT test_table(NULL, NULL);
 test_table 
------------
(0 rows)

Aggregate functions can't directly be labelled strict:

postgres=# CREATE FUNCTION custom_sum(INT, INT) RETURNS INT AS $$ SELECT $1 + $2; $$ LANGUAGE SQL IMMUTABLE STRICT;
CREATE FUNCTION
postgres=# CREATE AGGREGATE custome_sum_agg(INT) (SFUNC = custom_sum, STYPE = INT, STRICT);
WARNING:  aggregate attribute "strict" not recognized
CREATE AGGREGATE

but, the docs say this about strict:

If the state transition function is declared “strict”, then it cannot be called with null inputs. With such a transition function, aggregate execution behaves as follows. Rows with any null input values are ignored (the function is not called and the previous state value is retained). If the initial state value is null, then at the first row with all-nonnull input values, the first argument value replaces the state value, and the transition function is invoked at each subsequent row with all-nonnull input values. This is handy for implementing aggregates like max. Note that this behavior is only available when state_data_type is the same as the first arg_data_type. When these types are different, you must supply a nonnull initial condition or use a nonstrict transition function.

If the state transition function is not strict, then it will be called unconditionally at each input row, and must deal with null inputs and null state values for itself. This allows the aggregate author to have full control over the aggregate's handling of null values.

If the final function is declared “strict”, then it will not be called when the ending state value is null; instead a null result will be returned automatically. (Of course this is just the normal behavior of strict functions.) In any case the final function has the option of returning a null value. For example, the final function for avg returns null when it sees there were zero input rows.

https://www.postgresql.org/docs/current/sql-createaggregate.html

@alamb alamb added the api change Changes the API exposed to users of the crate label Jan 30, 2025
@alamb alamb changed the title bug: Fix NULL handling in array_slice bug: Fix NULL handling in array_slice, introduce NullHandling enum to Signature Jan 30, 2025
@alamb
Copy link
Contributor

alamb commented Jan 30, 2025

I marked this PR as an API change and updated the title to reflect it. I suggest we wait until we cut the release to merge

I hope to make a RC in the next few days

@jkosh44
Copy link
Contributor Author

jkosh44 commented Jan 31, 2025

There's still the open question of how window and aggregate functions should treat NullBehavior::Propagate. Table functions don't use the Signature struct, so we can ignore them for now. However, the one builtin table function re-implements the same idea:

/// Indicates the arguments used for generating a series.
#[derive(Debug, Clone)]
enum GenSeriesArgs {
/// ContainsNull signifies that at least one argument(start, end, step) was null, thus no series will be generated.
ContainsNull,
/// AllNotNullArgs holds the start, end, and step values for generating the series when all arguments are not null.
AllNotNullArgs { start: i64, end: i64, step: i64 },
}
. I have a couple of proposals, @jayzhan211 let me know if you have any preferences.

Approach 1. Window and aggregate functions ignore NullHandling::Propagate and maybe log a warning if it's set.

Approach 2. Aggregate functions return a final result of Null if any of the input is Null. Window functions emit Null for all rows after encountering the first Null.

Approach 3. Aggregate functions and window functions skip Null input, i.e. they don't update their state. With this approach, the name Propagate may not be the best and we may want to consider changing it to something more general.

Approach 4. Remove null_handling from Signature and add a new method to the ScalarUDFImpl with a default implementation like

pub trait ScalarUDFImpl: Debug + Send + Sync {
    ...

    fn null_handling(&self) -> NullHandling {
        NullHandling::PassThrough
    } 
}

then we wouldn't have to worry about aggregate and window functions. Additionally, if we do ever want to add something similar to aggregate and window functions, then we can use different enums for each function type that makes sense for that function type. For example,

enum ScalarNullHandling  {
    PassThrough,
    Propagate,
}
enum AggregateNullHandling {
    PassThrough,
    Propagate,
    Skip,
}
enum WindowNullHandling {
    PassThrough,
    Skip,
}

Personally, I'm leaning towards approach 4. However, I didn't actually understand above why it would be less of a breaking change to update Signature compared to updating ScalarUDFImpl.

@jayzhan211
Copy link
Contributor

Implement it as the method in the trait and call it inside 'invoke' methods requires more changes. However, I think it makes more sense now given null handling should be the logic in invoke instead of signature.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 1, 2025

Implement it as the method in the trait and call it inside 'invoke' methods requires more changes. However, I think it makes more sense now given null handling should be the logic in invoke instead of signature.

I just pushed an update to do this. If I've understood correctly, then I think I've addressed all the feedback. Please let me know if I missed something.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 1, 2025

If someone forgets to handle nulls in invoke, then don't we run the risk of accidentally returning different results depending on if the function was called with scalar arguments or with a batch of arguments?

Though I'm still slightly concerned about this. For example if we run array_slice on a batch we get the following:

> CREATE TABLE t (a int[], b int, c int);
0 row(s) fetched. 
Elapsed 0.004 seconds.

> INSERT INTO t VALUES ([1,2,3], 1, 2), ([1,2,3], 1, 2), ([1,2,3], 1, 2), ([1,2,3], 1, 2), ([1,2,3], 1, 2), ([1,2,3], 1, 2), ([1,2,3], 1, NULL), ([1,2,3], NULL, 2), (NULL, 1, 2);
+-------+
| count |
+-------+
| 9     |
+-------+
1 row(s) fetched. 
Elapsed 0.009 seconds.

> SELECT array_slice(a, b, c) FROM t;
+--------------+
| t.a[t.b:t.c] |
+--------------+
| [1, 2]       |
| [1, 2]       |
| [1, 2]       |
| [1, 2]       |
| [1, 2]       |
| [1, 2]       |
| [1, 2, 3]    |
| [1, 2]       |
| []           |
+--------------+
9 row(s) fetched. 
Elapsed 0.004 seconds.

but if we just run in on scalars then we get the following:

> SELECT array_slice([1,2,3], 1, NULL);
+-------------------------------------------------------+
| make_array(Int64(1),Int64(2),Int64(3))[Int64(1):NULL] |
+-------------------------------------------------------+
| NULL                                                  |
+-------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.003 seconds.

> SELECT array_slice([1,2,3], NULL, 2);
+-------------------------------------------------------+
| make_array(Int64(1),Int64(2),Int64(3))[NULL:Int64(2)] |
+-------------------------------------------------------+
| NULL                                                  |
+-------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.004 seconds.

> SELECT array_slice(arrow_cast(NULL, 'LargeList(Int64)'), 1, 2);
+--------------------------------------------------------------+
| arrow_cast(NULL,Utf8("LargeList(Int64)"))[Int64(1):Int64(2)] |
+--------------------------------------------------------------+
| NULL                                                         |
+--------------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.005 seconds.

EDIT: That seems to come from here:

// len 0 indicate array is null, return empty array in this row.
if len == O::usize_as(0) {
offsets.push(offsets[row_index]);
continue;
}
// If index is null, we consider it as the minimum / maximum index of the array.
let from_index = if from_array.is_null(row_index) {
Some(O::usize_as(0))
} else {
adjusted_from_index::<O>(from_array.value(row_index), len)?
};
let to_index = if to_array.is_null(row_index) {
Some(len - O::usize_as(1))
} else {
adjusted_to_index::<O>(to_array.value(row_index), len)?
};
. So probably those lines should just be updated to be compatible with DuckDB (i.e. return NULL).

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 1, 2025

@jayzhan211 I just pushed another commit that will correctly return null for batch inputs. This also updates the behavior of array_pop_front, array_pop_back, and array slicing (i.e. a[1:5]) since those functions are all implemented using the array_slice implementation. I checked with DuckDB and the new behavior matches DuckDB, but may be considered a breaking change for DataFusion.

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

@jayzhan211
Copy link
Contributor

We can merge this after the next release is out

@alamb
Copy link
Contributor

alamb commented Feb 3, 2025

I have made the release branch for 45 so we can merge this to main now.

Thanks a lot @jkosh44 and @jayzhan211

@alamb alamb merged commit 3dfce7d into apache:main Feb 3, 2025
25 checks passed
@jkosh44 jkosh44 deleted the array_slice_null branch February 3, 2025 15:45
@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 6, 2025

@jayzhan211 I'm sorry to draw this out more, but I've been thinking about this change recently and now that I understand the code a little better, I think we should remove NullHandling.

Here's why. NullHandling only handles inputs when they are ColumnarValue::Scalar(_) and does not handle inputs that are ColumnarValue::Array(_). So function implementers have to account for null ColumnarValue::Array(_) values in their implementation. Whatever the implementer does to account for ColumnarValue::Array(_) should also work for ColumnarValue::Scalar(_).

The only reason that is doesn't currently work for many of the functions, is that their signatures are signature: Signature::variadic_any(Volatility::Immutable),. This causes null scalar inputs to be passed in as ColumnarValue::Scalar(ScalarValue::Null) instead of the null value of the expected type (for example ColumnarValue::Scalar(ScalarValue::Int32(None))). When the implementation tries to treat ColumnarValue::Scalar(ScalarValue::Null) as the expected type either an error is returned or a panic happens. However, if we update a functions signature to actually describe the expected types, then a null scalar input is passed in as the null value for that type.

Furthermore, we can't add NullHandling::Propagate to a function without updating it's signature or else we risk accepting invalid argument types when one of the arguments are null. So, what ends up happening is that NullHandling::Propagate is always a no-op and ends up not being useful, or it compensates for a non-descriptive signature.

In fact if we apply the following diff, which comments out NullHandling::Propagate from array_slice, all of the tests still pass. (Note: I also realized that I accidentally added a 2 argument variant to array_slice which isn't valid so the diff includes fixing that).

diff --git a/datafusion/functions-nested/src/extract.rs b/datafusion/functions-nested/src/extract.rs
index 8c120876c..32af568ee 100644
--- a/datafusion/functions-nested/src/extract.rs
+++ b/datafusion/functions-nested/src/extract.rs
@@ -329,11 +329,6 @@ impl ArraySlice {
         Self {
             signature: Signature::one_of(
                 vec![
-                    TypeSignature::ArraySignature(
-                        ArrayFunctionSignature::ArrayAndIndexes(
-                            NonZeroUsize::new(1).expect("1 is non-zero"),
-                        ),
-                    ),
                     TypeSignature::ArraySignature(
                         ArrayFunctionSignature::ArrayAndIndexes(
                             NonZeroUsize::new(2).expect("2 is non-zero"),
@@ -390,9 +385,9 @@ impl ScalarUDFImpl for ArraySlice {
         Ok(arg_types[0].clone())
     }
 
-    fn null_handling(&self) -> NullHandling {
-        NullHandling::Propagate
-    }
+    // fn null_handling(&self) -> NullHandling {
+    //     NullHandling::Propagate
+    // }
 
     fn invoke_batch(
         &self,
diff --git a/datafusion/sqllogictest/test_files/array.slt b/datafusion/sqllogictest/test_files/array.slt
index d804bb424..fc80ec79d 100644
--- a/datafusion/sqllogictest/test_files/array.slt
+++ b/datafusion/sqllogictest/test_files/array.slt
@@ -1850,15 +1850,11 @@ select array_slice(arrow_cast(make_array(1, 2, 3, 4, 5), 'LargeList(Int64)'), 0,
 [] []
 
 # array_slice scalar function #11 (with NULL-NULL)
-query ??
+query error
 select array_slice(make_array(1, 2, 3, 4, 5), NULL), array_slice(make_array('h', 'e', 'l', 'l', 'o'), NULL);
-----
-NULL NULL
 
-query ??
+query error
 select array_slice(arrow_cast(make_array(1, 2, 3, 4, 5), 'LargeList(Int64)'), NULL), array_slice(arrow_cast(make_array('h', 'e', 'l', 'l', 'o'), 'LargeList(Utf8)'), NULL);
-----
-NULL NULL
 
 # array_slice scalar function #12 (with zero and negative number)
 query ??

What do you think? If you agree then I'd like to get a fix in before the next release is cut, whenever that is.

@jkosh44
Copy link
Contributor Author

jkosh44 commented Feb 6, 2025

(Note: I also realized that I accidentally added a 2 argument variant to array_slice which isn't valid so the diff includes fixing that).

I submitted a PR to fix this specifically: #14527

@jayzhan211
Copy link
Contributor

@jkosh44 It sounds like Nulls can be handled by signature at all? Sounds great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

array_slice can't correctly handle NULL parameters or some edge cases
4 participants