-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support DictionaryString for Regex matching operators #12768
Conversation
Thank you for this contribution @blaginin -- I agree this PR seems to have fixed the issue, but I am a little worried about it.
FYI @ozankabak I am a bit worried that the changes to the propagation logic did not cause any test failures 🤔 Maybe we need to increase test coverage in that area. |
What would you think about explicitly implementing physical operator support for DictionaryArrays? That would mean basically applying the operation on the That would look something like let input_array: DictionaryArray<Int32, String> = ...;
let values = input_array.values();
// apply regexp match on the values of the array
let regexp_match_result = regexp_match(values, ..);
// form the output boolean array by looking up the result
let mut bool_builder = BooleanBuilder::new()
for key in input_array.keys() {
bool_builder.push(regexp_match_result.value(key))
}
bool_builder.build(); // create output boolean arary |
You would have to find the relevant place to plumb this in to binary.rs too I think |
# Conflicts: # datafusion/sqllogictest/test_files/string/large_string.slt # datafusion/sqllogictest/test_files/string/string_view.slt
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
…y_flag_op_scalar`
Thanks for the review and detailed feedback, @alamb! 😍 To give a bit of context, when submitting the PR, I was choosing between removing Regex operators from However, your performance point is really valid!! Unwrapping the dict with bools directly can be much more efficient than unwrapping strings and then constructing booleans... I think I've made a change; could you take a look? A few things to consider on top of this PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @blaginin -- this looks great.
The only thing I think we should do is to deprecate is_comparison_op
to ease people's upgrades
/// For example, 'Binary(a, >, b)' would be a comparison expression. | ||
pub fn is_comparison_operator(&self) -> bool { | ||
/// For example, 'Binary(a, >, b)' expression supports propagation. | ||
pub fn supports_propagation(&self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is technically an API change -- Can you please avoid the change and help people upgrade by adding back is_comparison_operator
that calls supports_propagation
and mark it deprecated?
datafusion/datafusion/core/src/datasource/file_format/parquet.rs
Lines 610 to 619 in 3353c06
#[deprecated( | |
since = "40.0.0", | |
note = "please use `statistics_from_parquet_meta_calc` instead" | |
)] | |
pub async fn statistics_from_parquet_meta( | |
metadata: &ParquetMetaData, | |
table_schema: SchemaRef, | |
) -> Result<Statistics> { | |
statistics_from_parquet_meta_calc(metadata, table_schema) | |
} |
(We could add the deprecation as a follow on) |
/// propagation | ||
/// | ||
/// For example, 'Binary(a, >, b)' expression supports propagation. | ||
#[deprecated(since = "43.0.0", note = "please use `supports_propagation` instead")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @blaginin for working on this. It looks good to me 👍
Thanks @blaginin and @goldmedal for the review |
I just noticed the change at @blaginin do you have a preferred new name for |
hey @berkaysynnada, I found the name is_comparison_operator a bit confusing because it’s not consistent with the documentation. For example:
Since it’s only for |
I won't revert the name change to |
makes a lot of sense!! 👍 |
Which issue does this PR close?
Closes #12618
Rationale for this change
As explained in the original PR, regex comparison operations don't support dictionaries like
~*
.What changes are included in this PR?
When building a logical query, the query is initially correctly coerced to
Utf8
(which is supported inBinaryExpr
), but thenunwrap_cast_in_comparison
removes the column cast.Interestingly, for the
Operator::LikeMatch
operator, which is similar to regex operators,unwrap_cast_in_comparison
doesn't rewrite the query. This happens because of the function currently namedis_comparison_operator
.According to its usages, it is only used in interval propagation, and the number of actual arguments supported is actually smaller than the number mentioned in the method.
Moreover, the name
is_comparison_operator
seems a bit confusing, as it doesn't match the list of comparison operators in the docs. Therefore, I have renamed the method.Are these changes tested?
Yes, I uncommented tests in
datafusion/sqllogictest/test_files/string
.Are there any user-facing changes?
No