-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove element's nullability of array_agg function #11447
Conversation
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes sense to me, but I feel like we are undergoing some API churn. It may be good to ensure we won't have to change this again.
For example, is it the case that for all aggregates (both existing and potentially user defined ones) would not want to know if their input is nullable (or is there some other way to figure that out?)
Perhaps we could update the documentation somehow to make the design choice about null inputs --> output computation clearer
@eejbyfeldt I wonder if you have a chance to review as the original author of #11093 ?
@@ -36,7 +36,7 @@ async fn csv_query_array_agg_distinct() -> Result<()> { | |||
*actual[0].schema(), | |||
Schema::new(vec![Field::new_list( | |||
"ARRAY_AGG(DISTINCT aggregate_test_100.c2)", | |||
Field::new("item", DataType::UInt32, false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
array agg is NULL when there are no inputs, I believe after #11299 so this change makes sense to me
I think in theory UDAF could have btw, the change here is only for builtin array agg, and they are going to be removed soon #11448 😎 . Therefore the change is to avoid the possible API churn -- introduce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me -- thanks @jayzhan211
Let's plan to merge this tomorrow to allow anyone else who may wish to review |
I don't think I am really following the motivation here. For me, the reason to have the correct nullability of the field inside the returned list is the same as having correct nullability for any other field.
If this argument hold here. Then why is it not also an argument to remove the concept of nullability throughout datafusion? (Which to me, it seems like it would be very undesired) |
@eejbyfeldt I think the tradeoff here is having idea nullability for possible optimization and having correct but simple nullability for now and able to bring in idea nullability with actual optimization in the future.
The nullability of field is actually the reason that matters to me. Unlike the nullability for the whole function, the nullability of element is not clear to me whether it is worth to add the additional complexity. If there is any optimization like what you show me previously, I would be happy to keep the nullability and find an idea way to bring it to UDAF. What most of the aggregate function are doing is just go through all the rows, so even we know there is no null in the whole column, we still need to go through all the rows. For this reason, I don't think there is any optimization could benefit on the nullability of the element in column. 🤔 |
I defer to @jayzhan211 with what to do with this PR. I am fine either way (keep the API in case it might be helpful in the future or remove it and we can add it again if it is needed) |
I would like to merge this, since the builtin function code will eventually be removed, we can add it for UDAF later on |
Thanks @alamb and @eejbyfeldt for your suggestion |
* rm null Signed-off-by: jayzhan211 <[email protected]> * fmt Signed-off-by: jayzhan211 <[email protected]> * fix test Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]>
* rm null Signed-off-by: jayzhan211 <[email protected]> * fmt Signed-off-by: jayzhan211 <[email protected]> * fix test Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]>
* rm null Signed-off-by: jayzhan211 <[email protected]> * fmt Signed-off-by: jayzhan211 <[email protected]> * fix test Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]>
* rm null Signed-off-by: jayzhan211 <[email protected]> * fmt Signed-off-by: jayzhan211 <[email protected]> * fix test Signed-off-by: jayzhan211 <[email protected]> --------- Signed-off-by: jayzhan211 <[email protected]>
Which issue does this PR close?
Closes #.
Rationale for this change
I think the nullability of element in array_agg_* function (in general UDAF, those elements whether null or non-null could be handled in function invoke logic itself) doesn't help much about optimization, but add complexity for the function itself. I propose to remove it until there is any use case that shows the nullability of element is really helpful.
The related change is from #8055, where introduce
nullable
but place it in a incorrect place. Then, #11093 where change the nullable from list to element. And this PR is going to cleanup nullable of element 😫What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?