-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Type Coercion for UDF Arguments #14268
base: branch-45
Are you sure you want to change the base?
Conversation
signature: Signature::one_of( | ||
vec![ | ||
TypeSignature::String(1), | ||
TypeSignature::Coercible(vec![TypeSignatureClass::Native( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use coercible(string)
, we don't need string
since it is a more strict rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I initially did, but after testing on Sail, I discovered new test failures related to coercing input that's all String (e.g. func(Utf8, Utf8View)
).
The plan is to port all the relevant tests from Sail into this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 You can find the test failures here if interested!
lakehq/sail@372e13b#diff-bb36d996163d98235e107f8203c9c24be34ef71c84f8afa4a420a8e483102e3e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it was not my intention to apply this pattern on single arg functions. I'll get that fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current design for coercion may still have room for improvement. It would be beneficial to represent the function signature in a simpler and more concise manner, rather than relying on complex combinations of multiple, similar signatures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! I'll add that in.
} | ||
|
||
#[test] | ||
fn test_ascii_expr() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have preferred to place the various UDF tests within their respective files, but I couldn't due to circular dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ending up putting the tests in the .slt
file, but figured we can still leave this test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this test if the purpose of the test is covered already in slt
@shehabgamin #14440 I come out a flexible version of Signature::CoercibleV2 (temporary name), it can replace The most difference is that the I probably don't have time to push it forward in the recent days, if you are interested in it you can work on it. We can implement first version for the functions you mentioned, I believe the change makes more sense |
FWIW I think @Omega359 hit the same "int no longer automatically coerces to String" in his application too: #14230 (comment) |
I'll work on this tonight @alamb @jayzhan211 |
I think it makes sense to work on this for DataFusion 46!
@jayzhan211 If I am understanding you correctly, Done! I added |
I don't quite understand why we are adding more As we have discussed, we should avoid using old Coercible signature and also the TypeSignatureClass that is used in Coercible |
@jayzhan211 What did I miss here? |
As we have discussed, we should avoid using old Coercible signature and also the TypeSignatureClass that is used in Coercible, because any change might impact downstream projects, although if we add new
I think we can work directly on CoercibleV2 I mentioned for these functions
|
@jayzhan211 I thought you said it was okay to add a new signature if it helps downstream projects. See here: I reverted
IMO If I am understanding you correctly, you are okay with adding a new signature but not applying them to the UDFs in this PR? Should I apply User-defined coercion as you were mentioning earlier? |
This is the point, I don't quite understand why adding signature for Even if there is such case that we really need
I think this is the solution to the issue we have, we need coercible like signature but not fixed logic exposed to the user since it makes any changes to it breaking change, while @shehabgamin First of all, what are the issues we are solving? Are these functions in datafusion?
Are there others issues? Why |
Btw, if your proposed signature doesn't used by any functions in datafusion. You should use |
@jayzhan211 The regression is that all these functions before DataFusion 43 would coerce:
But now they no longer do. I am trying to find some middle ground here. I am happy to implement any signature for the functions in this PR, but currently,
I'm not sure that
The point is that |
Making
I guess this is the real issue, how do we have a solution that is easier to solve those UDFs I don't have the best solution in my mind now, but list the possible solutions
@shehabgamin Do you think we can write some utils function so we can make transferring to |
@jayzhan211 To keep it simple ill just remove AnyNative and use coerce_types so we don't block this PR any longer. We can have a larger discussion and align on goals afterwards! |
Done, this should be good to merge now. |
The function coercions are not enough for building a tailored system. Relational operators also do coercions (the set operators: union, intersect, except). In any case, for certain system designs -- those who take on responsibility of implementing their particular SQL dialect behavior before handing over the control over to DF core -- it's desirable to opt out from any coercion logic at all. @shehabgamin @linhr Given the Sail design, you might be interested in #12723. |
let arg_type = &arg_types[0]; | ||
let current_native_type: NativeType = arg_type.into(); | ||
let target_native_type = NativeType::String; | ||
if current_native_type.is_integer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we support integer? It is not consistent with Postgres/DuckDB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you and @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 Forsure, let's make sure @alamb is okay with this too before I go ahead and make the change.
let arg_type = &arg_types[0]; | ||
let current_native_type: NativeType = arg_type.into(); | ||
let target_native_type = NativeType::String; | ||
if current_native_type.is_integer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with this
// Numeric | ||
// Integer | ||
Numeric(LogicalTypeRef), | ||
Integer(LogicalTypeRef), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shehabgamin I found that we might not need LogicalTypeRef
.
This is designed to accept all the Integer, so if the given type is integer, we keep it as it is.
If we want specific integer type, then we should use Native instead. Does this makes sense to you?
Numeric is the same
|
||
} | ||
TypeSignatureClass::Integer(native_type) => { | ||
let target_type = native_type.native(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let target_type = native_type.native(); | |
Ok(current_type.to_owned()) |
return target_type.default_cast_for(current_type); | ||
} | ||
TypeSignatureClass::Numeric(native_type) => { | ||
let target_type = native_type.native(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let target_type = native_type.native(); | |
Ok(current_type.to_owned()) |
TypeSignatureClass::Native(l) => get_data_types(l.native()), | ||
TypeSignatureClass::Native(l) | ||
| TypeSignatureClass::Numeric(l) | ||
| TypeSignatureClass::Integer(l) => get_data_types(l.native()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_data_types
is used only in get_possible_types
Which issue does this PR close?
Closes #14230
Rationale for this change
A bug was introduced in DataFusion v43.0.0 that affects type coercion for UDF arguments. Sail's tests uncovered several of these regressions, which required explicit casting in multiple areas as a workaround during the upgrade to DataFusion 43.0.0.
The regressions identified by Sail's tests include the following functions:
ascii
bit_length
contains
ends_with
starts_with
octet_length
Upon digging into the code, I discovered the following:
Signature::coercible
.Signature::coercible
was incomplete. Coercion would only happen iflogical_type == target_type
,logical_type == NativeType::Null
, ortarget_type.is_integer() && logical_type.is_integer()
.What changes are included in this PR?
Signature::user_defined
Are these changes tested?
Yes.
Are there any user-facing changes?