Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve null handling for to_char #9689

Merged
merged 7 commits into from
Mar 24, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 18 additions & 3 deletions datafusion/functions/src/datetime/to_char.rs
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,11 @@ fn _to_char_scalar(
// of the implementation in arrow-rs we need to convert it to an array
let data_type = &expression.data_type();
let is_scalar_expression = matches!(&expression, ColumnarValue::Scalar(_));
let array = expression.into_array(1)?;
let array_from_expr = expression.into_array(1)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for Reviewer:
We can't directly return null when we see that the format is None because format_options (passed to ArrayFormatter::try_new) allow specifying the string to show for null & we need to respect this configuration option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is true that ArrayFormatter::try_new allows specifying the string to use for null values, I don't think that functionality is exposed via to_char

Thus I think this actually should simply return a new StringArray of all null values (which confusingly is different than NullArray)

So in this case I think if the format is None the code should return a null string value (namely ColumnarValue::Scalar(ScalarValue::Utf8(None)))

Copy link
Contributor Author

@tinfoil-knight tinfoil-knight Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this change. Slight difference in the type though. Please see #9689 (comment).

let array = match format {
Some(_) => array_from_expr,
None => ColumnarValue::create_null_array(array_from_expr.len()).into_array(1)?,
};
let format_options = match _build_format_options(data_type, format) {
Ok(value) => value,
Err(value) => return value,
Expand Down Expand Up @@ -215,8 +219,19 @@ fn _to_char_array(args: &[ColumnarValue]) -> Result<ColumnarValue> {
};
// this isn't ideal but this can't use ValueFormatter as it isn't independent
// from ArrayFormatter
let formatter = ArrayFormatter::try_new(arrays[0].as_ref(), &format_options)?;
let result = formatter.value(idx).try_to_string();
let result = match format {
Some(_) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case we should also be formatting to None, as in change

let mut results: Vec<String> = vec![];

to

let mut results: Vec<Option<String>> = vec![];

so that it can properly represent nulls

Copy link
Contributor Author

@tinfoil-knight tinfoil-knight Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing to use the Vec<String> type. Reason is same as #9689 (comment) .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated:
Type has been changed to Vec<Option<String>> now.

let formatter =
ArrayFormatter::try_new(arrays[0].as_ref(), &format_options)?;
formatter.value(idx).try_to_string()
}
None => {
let null_array = ColumnarValue::create_null_array(1).into_array(1)?;
let formatter =
ArrayFormatter::try_new(null_array.as_ref(), &format_options)?;
formatter.value(0).try_to_string()
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else I noticed while looking at this code is that

        ColumnarValue::Array(_) => Ok(ColumnarValue::Array(Arc::new(StringArray::from(
            results,
        )) as ArrayRef)),

Effectively means the strings are copied twice (once to results and then again to the array).

As a follow on PR we could potentially use StringBuilder to build the final string array directly rather than allocating a bunch of small strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For someone whose role is to find ways to optimize things I seem to be rather poor at it sometimes :) Nice catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well there are different degrees of optimization for sure ;)

match result {
Ok(value) => results.push(value),
Err(e) => return exec_err!("{}", e),
Expand Down
14 changes: 11 additions & 3 deletions datafusion/sqllogictest/test_files/timestamps.slt
Original file line number Diff line number Diff line change
Expand Up @@ -2661,7 +2661,7 @@ PT123456S
query T
select to_char(arrow_cast(123456, 'Duration(Second)'), null);
----
PT123456S
(empty)

query error DataFusion error: Execution error: Cast error: Format error
SELECT to_char(timestamps, '%X%K') from formats;
Expand All @@ -2672,14 +2672,22 @@ SELECT to_char('2000-02-03'::date, '%X%K');
query T
SELECT to_char(timestamps, null) from formats;
----
2024-01-01T06:00:00Z
2025-01-01T23:59:58Z
(empty)
(empty)

query T
SELECT to_char(null, '%d-%m-%Y');
----
(empty)

query T
SELECT to_char(column1, column2)
FROM
(VALUES ('2024-01-01 06:00:00'::timestamp, null), ('2025-01-01 23:59:58'::timestamp, '%d:%m:%Y %H-%M-%S'));
----
(empty)
01:01:2025 23-59-58

statement ok
drop table formats;

Expand Down
Loading