-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use total ordering in the min & max accumulator for floats #10627
fix: use total ordering in the min & max accumulator for floats #10627
Conversation
I had to change the I wonder if describe should filter out nan values when calculating min/max? |
FWI I checed what postgres does: postgres=# create table foo(x float);
CREATE TABLE
postgres=# insert into foo values (1), (2), ('NaN');
INSERT 0 3
postgres=# select * from foo;
x
-----
1
2
NaN
(3 rows)
postgres=# select min(x) from foo;
min
-----
1
(1 row)
postgres=# select max(x) from foo;
max
-----
NaN
(1 row) So that suggests to me it treats |
BTW here is a test that shows inconsistent Nan handling for f32 and f64 (just to help make the current behavior clearer): #10634 |
If this is the case then there is divergence between postgres and arrow-rs. Which takes priority? If you want to follow the postgres behavior then the change will be more complex. You will need a custom |
This is confirmed in the latest version of the postgres docs:
|
I would personally suggest we do whatever is consistent with arrow (and easier) unless there is a compelling reason to do something different |
@westonpace what is the status / plan with this PR? It has failing CI tests but is not marked as a draft. Are you still planning on working with it? Do you need help to push it along? |
…tch the ordering used by arrow kernels
85767a3
to
b918672
Compare
…ite discrepency but from a null/defined discrepency and we don't want that behavior to change
Thanks for the push. I have now discovered sqllogictests :) Turns out my approach didn't work because partial_cmp propagates nulls (and the previous impl ignored nulls). I think I've got it fixed now. The earlier describe test failure was also for this same reason (it was propagating a null, not a nan) and so I was able to revert that change. |
I suspect this means that the min/max function for intervals is also incorrectly propagating nulls but I tried to make a unit test for intervals and get an error that there is no accumulator for intervals and so I guess we can cross that bridge when we come to it. |
@alamb (See above reply, forgot to ping-reply you) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @westonpace -- this code looks great to me
I think we should add just a few more tests and this will be good to merge 🙏
accumulator.update_batch(&[vals_b]).unwrap(); | ||
let split_batch_result = &accumulator.evaluate().unwrap(); | ||
|
||
assert_eq!(single_batch_result, split_batch_result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be valuable to also test:
min
- the actul value of
single_batch_result
(to show if it is expected to produce nan or 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions, I have updated the test to check both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me -- thank you @westonpace
let zero = 0_f32; | ||
let neg_inf = f32::NEG_INFINITY; | ||
|
||
let check = |acc: &mut dyn Accumulator, values: &[&[f32]], expected: f32| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
…he#10627) * fix: use total ordering in the min & max accumulator for floats to match the ordering used by arrow kernels * change unit test to expect min to be nan * changed behavior again since the partial_cmp approach doesn't handle nulls correctly * Revert change to describe test. It was not originating from a nan/finite discrepency but from a null/defined discrepency and we don't want that behavior to change * Update the test to check the min function and also verify the result
Which issue does this PR close?
Closes #8031
Rationale for this change
See the motivating issue.
What changes are included in this PR?
The
MinAccumulator
andMaxAccumlator
now useScalarValue::partial_cmp
to compare two float scalars instead off32::min
andf32::max
(orf64
). This is an approach that was already being taken for intervals.Are these changes tested?
Yes, I added a unit test. I did not see any existing benchmarks for the accumulators. I'm not entirely certain if this approach is slower or not. However, I think a performance regression is unlikely since this only changes how we compare the intermediate results. I.e. given two arrays we will use the arrow kernels to find the min/max of the two arrays. Then we only use this path to compare the result of those two calculations. (e.g. we are in per-batch space and not per-row space)
Are there any user-facing changes?
No.