Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove element's nullability of array_agg function #11447

Merged
merged 3 commits into from
Jul 17, 2024

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Jul 13, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

I think the nullability of element in array_agg_* function (in general UDAF, those elements whether null or non-null could be handled in function invoke logic itself) doesn't help much about optimization, but add complexity for the function itself. I propose to remove it until there is any use case that shows the nullability of element is really helpful.

The related change is from #8055, where introduce nullable but place it in a incorrect place. Then, #11093 where change the nullable from list to element. And this PR is going to cleanup nullable of element 😫

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added the physical-expr Physical Expressions label Jul 13, 2024
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added the core Core DataFusion crate label Jul 13, 2024
@jayzhan211 jayzhan211 marked this pull request as ready for review July 13, 2024 07:35
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me, but I feel like we are undergoing some API churn. It may be good to ensure we won't have to change this again.

For example, is it the case that for all aggregates (both existing and potentially user defined ones) would not want to know if their input is nullable (or is there some other way to figure that out?)

Perhaps we could update the documentation somehow to make the design choice about null inputs --> output computation clearer

@eejbyfeldt I wonder if you have a chance to review as the original author of #11093 ?

@@ -36,7 +36,7 @@ async fn csv_query_array_agg_distinct() -> Result<()> {
*actual[0].schema(),
Schema::new(vec![Field::new_list(
"ARRAY_AGG(DISTINCT aggregate_test_100.c2)",
Field::new("item", DataType::UInt32, false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array agg is NULL when there are no inputs, I believe after #11299 so this change makes sense to me

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jul 15, 2024

I think in theory UDAF could have input_nullable, but I'm not sure about what could be the benefit from it. 🤔

btw, the change here is only for builtin array agg, and they are going to be removed soon #11448 😎 . Therefore the change is to avoid the possible API churn -- introduce input_nullable for converting array agg to UDAF

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thanks @jayzhan211

@alamb
Copy link
Contributor

alamb commented Jul 15, 2024

Let's plan to merge this tomorrow to allow anyone else who may wish to review

@eejbyfeldt
Copy link
Contributor

I don't think I am really following the motivation here. For me, the reason to have the correct nullability of the field inside the returned list is the same as having correct nullability for any other field.

I propose to remove it until there is any use case that shows the nullability of element is really helpful.

If this argument hold here. Then why is it not also an argument to remove the concept of nullability throughout datafusion? (Which to me, it seems like it would be very undesired)

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jul 16, 2024

@eejbyfeldt
I see. I agree the motivation of this PR is not really a good reason, but it is my tradeoff about how we can move on for aggregate function.

I think the tradeoff here is having idea nullability for possible optimization and having correct but simple nullability for now and able to bring in idea nullability with actual optimization in the future.

For me, the reason to have the correct nullability of the field inside the returned list is the same as having correct nullability for any other field

The nullability of field is actually the reason that matters to me. Unlike the nullability for the whole function, the nullability of element is not clear to me whether it is worth to add the additional complexity. If there is any optimization like what you show me previously, I would be happy to keep the nullability and find an idea way to bring it to UDAF.

What most of the aggregate function are doing is just go through all the rows, so even we know there is no null in the whole column, we still need to go through all the rows. For this reason, I don't think there is any optimization could benefit on the nullability of the element in column. 🤔

@alamb
Copy link
Contributor

alamb commented Jul 16, 2024

I defer to @jayzhan211 with what to do with this PR. I am fine either way (keep the API in case it might be helpful in the future or remove it and we can add it again if it is needed)

@jayzhan211
Copy link
Contributor Author

I would like to merge this, since the builtin function code will eventually be removed, we can add it for UDAF later on

@jayzhan211 jayzhan211 merged commit d67b0fb into apache:main Jul 17, 2024
23 checks passed
@jayzhan211 jayzhan211 deleted the rm-null-array-agg branch July 17, 2024 05:34
@jayzhan211
Copy link
Contributor Author

Thanks @alamb and @eejbyfeldt for your suggestion

xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 17, 2024
* rm null

Signed-off-by: jayzhan211 <[email protected]>

* fmt

Signed-off-by: jayzhan211 <[email protected]>

* fix test

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
* rm null

Signed-off-by: jayzhan211 <[email protected]>

* fmt

Signed-off-by: jayzhan211 <[email protected]>

* fix test

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>
Lordworms pushed a commit to Lordworms/arrow-datafusion that referenced this pull request Jul 23, 2024
* rm null

Signed-off-by: jayzhan211 <[email protected]>

* fmt

Signed-off-by: jayzhan211 <[email protected]>

* fix test

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>
wiedld pushed a commit to influxdata/arrow-datafusion that referenced this pull request Jul 31, 2024
* rm null

Signed-off-by: jayzhan211 <[email protected]>

* fmt

Signed-off-by: jayzhan211 <[email protected]>

* fix test

Signed-off-by: jayzhan211 <[email protected]>

---------

Signed-off-by: jayzhan211 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants