-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Reduce cast.rs and utils.rs logic from parquet_support.rs for experimental native scans #1387
base: main
Are you sure you want to change the base?
Conversation
utils.rs's |
@@ -596,7 +595,10 @@ fn cast_array( | |||
parquet_options: &SparkParquetOptions, | |||
) -> DataFusionResult<ArrayRef> { | |||
use DataType::*; | |||
let array = array_with_timezone(array, parquet_options.timezone.clone(), Some(to_type))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was causing one of the duplicate time zone changes.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1387 +/- ##
=============================================
- Coverage 56.12% 39.06% -17.06%
- Complexity 976 2071 +1095
=============================================
Files 119 263 +144
Lines 11743 60746 +49003
Branches 2251 12909 +10658
=============================================
+ Hits 6591 23733 +17142
- Misses 4012 32530 +28518
- Partials 1140 4483 +3343 ☔ View full report in Codecov by Sentry. |
@@ -596,7 +595,10 @@ fn cast_array( | |||
parquet_options: &SparkParquetOptions, | |||
) -> DataFusionResult<ArrayRef> { | |||
use DataType::*; | |||
let array = array_with_timezone(array, parquet_options.timezone.clone(), Some(to_type))?; | |||
let array = match to_type { | |||
Timestamp(_, None) => array, // array_with_timezone does not support to_type of NTZ. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will skip all the matches inside array_with_timezone
match array.data_type() {
DataType::Timestamp(_, None) => {
...
match to_type {
...
Some(DataType::Timestamp(_, None)) => {
timestamp_ntz_to_timestamp(array, timezone.as_str(), None)
}
...
Is it intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it shouldn't be called with that type:
// | Any other type | Not Supported | |
The condition that matches inside array_with_timezone
was added by #1348...
timestamp_ntz_to_timestamp(array, timezone.as_str(), None) |
...and we likely didn't understand the blast radius at the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact maybe I'll remove that in this PR if nothing breaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utils.rs
timezone functions are intended for use by cast
and specifically convert between Spark representation and Arrow representation. Changing them is likely to break some timezone related unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or what about updating the case match in array_with_timezone
rather than fixing at this location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I updated array_with_timezone
.
let array = array_with_timezone(array, parquet_options.timezone.clone(), Some(to_type))?; | ||
let array = match to_type { | ||
Timestamp(_, None) => array, // array_with_timezone does not support to_type of NTZ. | ||
_ => array_with_timezone(array, parquet_options.timezone.clone(), Some(to_type))?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you log an issue to review the use of array_with_timezone
? The method was written to cast timestamps provided by Spark while here we are casting timestamps provided by Parquet and there might be subtle cases where the behavior is different. As long as the unit tests pass, I believe we are safe for all use cases, but it would be a good idea to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slowly going through all of the behavior that we hoisted from cast.rs to parquet_support.rs, but yep I can file an issue.
For context: when we were working in the experimental branch and added the This change does not introduce any new failures for the experimental native scans, but that doesn't mean this approach is bug free. @andygrove is working on some fuzz testing that will help us have more confidence in the experimental native scans in the future. I propose we move forward with this design, and bring over (or call into) cast.rs logic as we discover issues with this design. |
Updated the test failures in the original description. |
@@ -72,9 +72,6 @@ pub fn array_with_timezone( | |||
Some(DataType::Timestamp(_, Some(_))) => { | |||
timestamp_ntz_to_timestamp(array, timezone.as_str(), Some(timezone.as_str())) | |||
} | |||
Some(DataType::Timestamp(_, None)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc this was introduced to fix cast issues caught by the fuzz tests which are not part of the ci. Can we run the fuzz tests again to verify that there are no regressions.
Ideally, parquet_support should not depend on any functions in utils.rs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be surprised if it came in for fuzz tests since the blame has the line coming in with the experimental native scans PR. AFAIK, parquet_support no longer depends on utils.rs after this PR. I was also just removing this errant line.
matches!(to_type, DataType::Utf8) | ||
} | ||
_ => false, | ||
_ => Ok(cast_with_options(&array, to_type, &PARQUET_OPTIONS)?), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I had the impression that this PR had originally reduced the number of failures in native_iceberg_compat as well but that is no longer true after the cleanup. Is that correct? |
I applied this PR to my working branch : https://github.com/parthchandra/datafusion-comet/commits/complex_test_failures_with_PR1387/ The differences are due to 5 tests related to
So it looks like this PR does not have a negative impact on |
Initially that was the case, then I got more aggressive with pruning parquet_support. |
Well that's good news! |
More info -
with So the difference in behavior is because of a different timezone being applied. Both are using session timezone, but one of them is wrong. Either way, we cannot pin the failures on this PR. |
Which issue does this PR close?
Closes #.
Rationale for this change
Reduces number of test failures in
native_datafusion
andnative_iceberg_compat
modes:What changes are included in this PR?
After staring at conversions and hoisting code from arrow-rs to debug, it turns out we can just use arrow's cast in this case.
How are these changes tested?
Ran
make test-jvm
fornative_datafusion
andnative_iceberg_compat
modes.