Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use total ordering in the min & max accumulator for floats #10627

Merged

Conversation

westonpace
Copy link
Member

Which issue does this PR close?

Closes #8031

Rationale for this change

See the motivating issue.

What changes are included in this PR?

The MinAccumulator and MaxAccumlator now use ScalarValue::partial_cmp to compare two float scalars instead of f32::min and f32::max (or f64). This is an approach that was already being taken for intervals.

Are these changes tested?

Yes, I added a unit test. I did not see any existing benchmarks for the accumulators. I'm not entirely certain if this approach is slower or not. However, I think a performance regression is unlikely since this only changes how we compare the intermediate results. I.e. given two arrays we will use the arrow kernels to find the min/max of the two arrays. Then we only use this path to compare the result of those two calculations. (e.g. we are in per-batch space and not per-row space)

Are there any user-facing changes?

No.

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels May 22, 2024
@westonpace
Copy link
Member Author

I had to change the describe test because the float / double columns have nans (or maybe -inf?). This means that the min, which used to be a value, is now nan (which doesn't seem to display).

I wonder if describe should filter out nan values when calculating min/max?

@alamb
Copy link
Contributor

alamb commented May 23, 2024

I had to change the describe test because the float / double columns have nans (or maybe -inf?). This means that the min, which used to be a value, is now nan (which doesn't seem to display).

I wonder if describe should filter out nan values when calculating min/max?

FWI I checed what postgres does:

postgres=# create table foo(x float);
CREATE TABLE

postgres=# insert into foo values (1), (2), ('NaN');
INSERT 0 3
postgres=# select * from foo;
  x
-----
   1
   2
 NaN
(3 rows)

postgres=# select min(x) from foo;
 min
-----
   1
(1 row)

postgres=# select max(x) from foo;
 max
-----
 NaN
(1 row)

So that suggests to me it treats NaN as the largest floating point value

@alamb
Copy link
Contributor

alamb commented May 23, 2024

BTW here is a test that shows inconsistent Nan handling for f32 and f64 (just to help make the current behavior clearer): #10634

@westonpace
Copy link
Member Author

westonpace commented May 24, 2024

So that suggests to me it treats NaN as the largest floating point value

If this is the case then there is divergence between postgres and arrow-rs. Which takes priority?

If you want to follow the postgres behavior then the change will be more complex. You will need a custom max function for arrays (instead of using arrow_arith::aggregate::max) in addition to changes in the accumulator.

@westonpace
Copy link
Member Author

So that suggests to me it treats NaN as the largest floating point value

This is confirmed in the latest version of the postgres docs:

IEEE 754 specifies that NaN should not compare equal to any other floating-point value (including NaN). In order to allow floating-point values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values.

@alamb
Copy link
Contributor

alamb commented May 25, 2024

If this is the case then there is divergence between postgres and arrow-rs. Which takes priority?

I would personally suggest we do whatever is consistent with arrow (and easier) unless there is a compelling reason to do something different

@alamb
Copy link
Contributor

alamb commented Jun 3, 2024

@westonpace what is the status / plan with this PR? It has failing CI tests but is not marked as a draft. Are you still planning on working with it? Do you need help to push it along?

@westonpace westonpace force-pushed the fix/use-total-ordering-in-min-max-accumulator branch from 85767a3 to b918672 Compare June 5, 2024 18:29
…ite discrepency but from a null/defined discrepency and we don't want that behavior to change
@github-actions github-actions bot removed the core Core DataFusion crate label Jun 5, 2024
@westonpace
Copy link
Member Author

@westonpace what is the status / plan with this PR? It has failing CI tests but is not marked as a draft. Are you still planning on working with it? Do you need help to push it along?

Thanks for the push. I have now discovered sqllogictests :)

Turns out my approach didn't work because partial_cmp propagates nulls (and the previous impl ignored nulls). I think I've got it fixed now. The earlier describe test failure was also for this same reason (it was propagating a null, not a nan) and so I was able to revert that change.

@westonpace
Copy link
Member Author

I suspect this means that the min/max function for intervals is also incorrectly propagating nulls but I tried to make a unit test for intervals and get an error that there is no accumulator for intervals and so I guess we can cross that bridge when we come to it.

@westonpace
Copy link
Member Author

@alamb (See above reply, forgot to ping-reply you)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @westonpace -- this code looks great to me

I think we should add just a few more tests and this will be good to merge 🙏

accumulator.update_batch(&[vals_b]).unwrap();
let split_batch_result = &accumulator.evaluate().unwrap();

assert_eq!(single_batch_result, split_batch_result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be valuable to also test:

  1. min
  2. the actul value of single_batch_result (to show if it is expected to produce nan or 0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions, I have updated the test to check both.

@westonpace westonpace requested a review from alamb June 7, 2024 13:22
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @westonpace

let zero = 0_f32;
let neg_inf = f32::NEG_INFINITY;

let check = |acc: &mut dyn Accumulator, values: &[&[f32]], expected: f32| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@alamb alamb merged commit 5bb6b35 into apache:main Jun 7, 2024
23 checks passed
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…he#10627)

* fix: use total ordering in the min & max accumulator for floats to match the ordering used by arrow kernels

* change unit test to expect min to be nan

* changed behavior again since the partial_cmp approach doesn't handle nulls correctly

* Revert change to describe test.  It was not originating from a nan/finite discrepency but from a null/defined discrepency and we don't want that behavior to change

* Update the test to check the min function and also verify the result
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent null handling in min/max
2 participants