Remove element's nullability of array_agg function #11447

jayzhan211 · 2024-07-13T06:17:35Z

Which issue does this PR close?

Closes #.

Rationale for this change

I think the nullability of element in array_agg_* function (in general UDAF, those elements whether null or non-null could be handled in function invoke logic itself) doesn't help much about optimization, but add complexity for the function itself. I propose to remove it until there is any use case that shows the nullability of element is really helpful.

The related change is from #8055, where introduce nullable but place it in a incorrect place. Then, #11093 where change the nullable from list to element. And this PR is going to cleanup nullable of element 😫

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <[email protected]>

alamb

This change makes sense to me, but I feel like we are undergoing some API churn. It may be good to ensure we won't have to change this again.

For example, is it the case that for all aggregates (both existing and potentially user defined ones) would not want to know if their input is nullable (or is there some other way to figure that out?)

Perhaps we could update the documentation somehow to make the design choice about null inputs --> output computation clearer

@eejbyfeldt I wonder if you have a chance to review as the original author of #11093 ?

alamb · 2024-07-14T19:08:19Z

datafusion/core/tests/sql/aggregates.rs

@@ -36,7 +36,7 @@ async fn csv_query_array_agg_distinct() -> Result<()> {
        *actual[0].schema(),
        Schema::new(vec![Field::new_list(
            "ARRAY_AGG(DISTINCT aggregate_test_100.c2)",
-            Field::new("item", DataType::UInt32, false),


array agg is NULL when there are no inputs, I believe after #11299 so this change makes sense to me

jayzhan211 · 2024-07-15T01:24:56Z

I think in theory UDAF could have input_nullable, but I'm not sure about what could be the benefit from it. 🤔

btw, the change here is only for builtin array agg, and they are going to be removed soon #11448 😎 . Therefore the change is to avoid the possible API churn -- introduce input_nullable for converting array agg to UDAF

alamb

Makes sense to me -- thanks @jayzhan211

alamb · 2024-07-15T18:52:10Z

Let's plan to merge this tomorrow to allow anyone else who may wish to review

eejbyfeldt · 2024-07-16T04:37:21Z

I don't think I am really following the motivation here. For me, the reason to have the correct nullability of the field inside the returned list is the same as having correct nullability for any other field.

I propose to remove it until there is any use case that shows the nullability of element is really helpful.

If this argument hold here. Then why is it not also an argument to remove the concept of nullability throughout datafusion? (Which to me, it seems like it would be very undesired)

jayzhan211 · 2024-07-16T09:41:07Z

@eejbyfeldt
I see. I agree the motivation of this PR is not really a good reason, but it is my tradeoff about how we can move on for aggregate function.

I think the tradeoff here is having idea nullability for possible optimization and having correct but simple nullability for now and able to bring in idea nullability with actual optimization in the future.

For me, the reason to have the correct nullability of the field inside the returned list is the same as having correct nullability for any other field

The nullability of field is actually the reason that matters to me. Unlike the nullability for the whole function, the nullability of element is not clear to me whether it is worth to add the additional complexity. If there is any optimization like what you show me previously, I would be happy to keep the nullability and find an idea way to bring it to UDAF.

What most of the aggregate function are doing is just go through all the rows, so even we know there is no null in the whole column, we still need to go through all the rows. For this reason, I don't think there is any optimization could benefit on the nullability of the element in column. 🤔

alamb · 2024-07-16T20:36:18Z

I defer to @jayzhan211 with what to do with this PR. I am fine either way (keep the API in case it might be helpful in the future or remove it and we can add it again if it is needed)

jayzhan211 · 2024-07-17T05:33:53Z

I would like to merge this, since the builtin function code will eventually be removed, we can add it for UDAF later on

jayzhan211 · 2024-07-17T05:34:47Z

Thanks @alamb and @eejbyfeldt for your suggestion

* rm null Signed-off-by: jayzhan211 <[email protected]> * fmt Signed-off-by: jayzhan211 <[email protected]> * fix test Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]>

rm null

35e94de

Signed-off-by: jayzhan211 <[email protected]>

github-actions bot added the physical-expr Physical Expressions label Jul 13, 2024

jayzhan211 added 2 commits July 13, 2024 14:21

fmt

105b6ce

Signed-off-by: jayzhan211 <[email protected]>

fix test

bb9b6e7

Signed-off-by: jayzhan211 <[email protected]>

github-actions bot added the core Core DataFusion crate label Jul 13, 2024

jayzhan211 marked this pull request as ready for review July 13, 2024 07:35

alamb reviewed Jul 14, 2024

View reviewed changes

alamb approved these changes Jul 15, 2024

View reviewed changes

jayzhan211 merged commit d67b0fb into apache:main Jul 17, 2024
23 checks passed

jayzhan211 deleted the rm-null-array-agg branch July 17, 2024 05:34

rluvaton mentioned this pull request Dec 12, 2024

Allow to filter null in array_agg #13742

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove element's nullability of array_agg function #11447

Remove element's nullability of array_agg function #11447

jayzhan211 commented Jul 13, 2024 •

edited

Loading

alamb left a comment

alamb Jul 14, 2024

jayzhan211 commented Jul 15, 2024 •

edited

Loading

alamb left a comment

alamb commented Jul 15, 2024

eejbyfeldt commented Jul 16, 2024

jayzhan211 commented Jul 16, 2024 •

edited

Loading

alamb commented Jul 16, 2024

jayzhan211 commented Jul 17, 2024

jayzhan211 commented Jul 17, 2024

Remove element's nullability of array_agg function #11447

Remove element's nullability of array_agg function #11447

Conversation

jayzhan211 commented Jul 13, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Jul 14, 2024

Choose a reason for hiding this comment

jayzhan211 commented Jul 15, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jul 15, 2024

eejbyfeldt commented Jul 16, 2024

jayzhan211 commented Jul 16, 2024 • edited Loading

alamb commented Jul 16, 2024

jayzhan211 commented Jul 17, 2024

jayzhan211 commented Jul 17, 2024

jayzhan211 commented Jul 13, 2024 •

edited

Loading

jayzhan211 commented Jul 15, 2024 •

edited

Loading

jayzhan211 commented Jul 16, 2024 •

edited

Loading