fix: use total ordering in the min & max accumulator for floats #10627

westonpace · 2024-05-22T20:07:29Z

Which issue does this PR close?

Rationale for this change

See the motivating issue.

What changes are included in this PR?

The MinAccumulator and MaxAccumlator now use ScalarValue::partial_cmp to compare two float scalars instead of f32::min and f32::max (or f64). This is an approach that was already being taken for intervals.

Are these changes tested?

Yes, I added a unit test. I did not see any existing benchmarks for the accumulators. I'm not entirely certain if this approach is slower or not. However, I think a performance regression is unlikely since this only changes how we compare the intermediate results. I.e. given two arrays we will use the arrow kernels to find the min/max of the two arrays. Then we only use this path to compare the result of those two calculations. (e.g. we are in per-batch space and not per-row space)

Are there any user-facing changes?

No.

westonpace · 2024-05-22T21:08:23Z

I had to change the describe test because the float / double columns have nans (or maybe -inf?). This means that the min, which used to be a value, is now nan (which doesn't seem to display).

I wonder if describe should filter out nan values when calculating min/max?

alamb · 2024-05-23T10:40:12Z

I had to change the describe test because the float / double columns have nans (or maybe -inf?). This means that the min, which used to be a value, is now nan (which doesn't seem to display).

I wonder if describe should filter out nan values when calculating min/max?

FWI I checed what postgres does:

postgres=# create table foo(x float);
CREATE TABLE

postgres=# insert into foo values (1), (2), ('NaN');
INSERT 0 3
postgres=# select * from foo;
  x
-----
   1
   2
 NaN
(3 rows)

postgres=# select min(x) from foo;
 min
-----
   1
(1 row)

postgres=# select max(x) from foo;
 max
-----
 NaN
(1 row)

So that suggests to me it treats NaN as the largest floating point value

alamb · 2024-05-23T10:52:21Z

BTW here is a test that shows inconsistent Nan handling for f32 and f64 (just to help make the current behavior clearer): #10634

westonpace · 2024-05-24T13:12:47Z

So that suggests to me it treats NaN as the largest floating point value

If this is the case then there is divergence between postgres and arrow-rs. Which takes priority?

If you want to follow the postgres behavior then the change will be more complex. You will need a custom max function for arrays (instead of using arrow_arith::aggregate::max) in addition to changes in the accumulator.

westonpace · 2024-05-24T13:15:45Z

So that suggests to me it treats NaN as the largest floating point value

This is confirmed in the latest version of the postgres docs:

IEEE 754 specifies that NaN should not compare equal to any other floating-point value (including NaN). In order to allow floating-point values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values.

alamb · 2024-05-25T09:59:01Z

If this is the case then there is divergence between postgres and arrow-rs. Which takes priority?

I would personally suggest we do whatever is consistent with arrow (and easier) unless there is a compelling reason to do something different

alamb · 2024-06-03T13:47:45Z

@westonpace what is the status / plan with this PR? It has failing CI tests but is not marked as a draft. Are you still planning on working with it? Do you need help to push it along?

…tch the ordering used by arrow kernels

…nulls correctly

…ite discrepency but from a null/defined discrepency and we don't want that behavior to change

westonpace · 2024-06-05T18:51:26Z

@westonpace what is the status / plan with this PR? It has failing CI tests but is not marked as a draft. Are you still planning on working with it? Do you need help to push it along?

Thanks for the push. I have now discovered sqllogictests :)

Turns out my approach didn't work because partial_cmp propagates nulls (and the previous impl ignored nulls). I think I've got it fixed now. The earlier describe test failure was also for this same reason (it was propagating a null, not a nan) and so I was able to revert that change.

westonpace · 2024-06-05T18:52:24Z

I suspect this means that the min/max function for intervals is also incorrectly propagating nulls but I tried to make a unit test for intervals and get an error that there is no accumulator for intervals and so I guess we can cross that bridge when we come to it.

westonpace · 2024-06-06T16:17:33Z

@alamb (See above reply, forgot to ping-reply you)

alamb

Thank you @westonpace -- this code looks great to me

I think we should add just a few more tests and this will be good to merge 🙏

alamb · 2024-06-06T21:27:58Z

datafusion/physical-expr/src/aggregate/min_max.rs

+        accumulator.update_batch(&[vals_b]).unwrap();
+        let split_batch_result = &accumulator.evaluate().unwrap();
+
+        assert_eq!(single_batch_result, split_batch_result);


I think it would be valuable to also test:

min

the actul value of single_batch_result (to show if it is expected to produce nan or 0)

Thanks for the suggestions, I have updated the test to check both.

alamb

Looks great to me -- thank you @westonpace

alamb · 2024-06-07T14:26:36Z

datafusion/physical-expr/src/aggregate/min_max.rs

+        let zero = 0_f32;
+        let neg_inf = f32::NEG_INFINITY;
+
+        let check = |acc: &mut dyn Accumulator, values: &[&[f32]], expected: f32| {


…he#10627) * fix: use total ordering in the min & max accumulator for floats to match the ordering used by arrow kernels * change unit test to expect min to be nan * changed behavior again since the partial_cmp approach doesn't handle nulls correctly * Revert change to describe test. It was not originating from a nan/finite discrepency but from a null/defined discrepency and we don't want that behavior to change * Update the test to check the min function and also verify the result

github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels May 22, 2024

alamb mentioned this pull request May 23, 2024

Minor: Add tests showing aggregate behavior for NaNs #10634

Merged

westonpace added 3 commits June 5, 2024 09:55

fix: use total ordering in the min & max accumulator for floats to ma…

5e80d0f

…tch the ordering used by arrow kernels

change unit test to expect min to be nan

c3108bb

changed behavior again since the partial_cmp approach doesn't handle …

b918672

…nulls correctly

westonpace force-pushed the fix/use-total-ordering-in-min-max-accumulator branch from 85767a3 to b918672 Compare June 5, 2024 18:29

Revert change to describe test. It was not originating from a nan/fin…

f9a53dd

…ite discrepency but from a null/defined discrepency and we don't want that behavior to change

github-actions bot removed the core Core DataFusion crate label Jun 5, 2024

alamb reviewed Jun 6, 2024

View reviewed changes

Update the test to check the min function and also verify the result

ba4ef65

westonpace requested a review from alamb June 7, 2024 13:22

alamb approved these changes Jun 7, 2024

View reviewed changes

alamb merged commit 5bb6b35 into apache:main Jun 7, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use total ordering in the min & max accumulator for floats #10627

fix: use total ordering in the min & max accumulator for floats #10627

westonpace commented May 22, 2024

westonpace commented May 22, 2024

alamb commented May 23, 2024

alamb commented May 23, 2024

westonpace commented May 24, 2024 •

edited

Loading

westonpace commented May 24, 2024

alamb commented May 25, 2024

alamb commented Jun 3, 2024

westonpace commented Jun 5, 2024

westonpace commented Jun 5, 2024

westonpace commented Jun 6, 2024

alamb left a comment

alamb Jun 6, 2024

westonpace Jun 7, 2024

alamb left a comment

alamb Jun 7, 2024

fix: use total ordering in the min & max accumulator for floats #10627

fix: use total ordering in the min & max accumulator for floats #10627

Conversation

westonpace commented May 22, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

westonpace commented May 22, 2024

alamb commented May 23, 2024

alamb commented May 23, 2024

westonpace commented May 24, 2024 • edited Loading

westonpace commented May 24, 2024

alamb commented May 25, 2024

alamb commented Jun 3, 2024

westonpace commented Jun 5, 2024

westonpace commented Jun 5, 2024

westonpace commented Jun 6, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Jun 6, 2024

Choose a reason for hiding this comment

westonpace Jun 7, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Jun 7, 2024

Choose a reason for hiding this comment

westonpace commented May 24, 2024 •

edited

Loading