Improve median performance. #6837

vincev · 2023-07-04T08:44:06Z

Which issue does this PR close?

Related to discussion in #4973.

Rationale for this change

This PR improves the median aggregator by reducing the number of allocations.

For this example I am using one year of data from the nyctaxi dataset, running a group by payment_type and computing the total_amount median, with the current version it takes ~22secs:

$ time ./target/release/datafusion-median
+--------------+--------------+----------+
| payment_type | total_amount | n        |
+--------------+--------------+----------+
| 5            | 5.275        | 6        |
| 3            | 7.7          | 194323   |
| 4            | -6.8         | 244364   |
| 0            | 23.0         | 1368303  |
| 2            | 13.3         | 7763339  |
| 1            | 16.56        | 30085763 |
+--------------+--------------+----------+

real    0m22.861s
user    1m54.033s
sys     0m2.872s

with the changes introduced in this PR we get the same result in ~2secs:

$ time ./target/release/datafusion-median
+--------------+--------------+----------+
| payment_type | total_amount | n        |
+--------------+--------------+----------+
| 5            | 5.275        | 6        |
| 3            | 7.7          | 194323   |
| 4            | -6.8         | 244364   |
| 0            | 23.0         | 1368303  |
| 2            | 13.3         | 7763339  |
| 1            | 16.56        | 30085763 |
+--------------+--------------+----------+

real    0m2.514s
user    0m4.725s
sys     0m1.765s

What changes are included in this PR?

Reduce number of allocations.

Are these changes tested?

Run tests locally they all passed.

Are there any user-facing changes?

Dandandan · 2023-07-04T09:09:25Z

datafusion/physical-expr/src/aggregate/median.rs

-        let state =
-            ScalarValue::new_list(Some(self.all_values.clone()), self.data_type.clone());
+        let all_values = to_scalar_values(&self.batches)?;
+        let state = ScalarValue::new_list(Some(all_values), self.data_type.clone());


I wonder if we can change ScalarValue::List to use ArrayRef instead of Vec<ScalarValue> internally.
This would avoid quite some expensive conversions from/to ScalarValue.

Yes that would help a lot, maybe we can add a ScalarValue::Array variant to ScalarValue.

Maybe @Dandandan was suggesting

impl ScalarValue { ... List(ArrayRef) ... }

In general this would work well with the approach @tustvold is working on upstream in arrow-rs with Datum -- apache/arrow-rs#4393

Yeah that's what I was trying to suggest - this would avoid the need to convert / allocate to individual ScalarValues and convert to Array later on again.

That would work well, the reason I was suggesting adding an Array variant instead of changing List was to avoid changing all the code that depends on List(Option<Vec<ScalarValue>>, FieldRef), but yes long term that would probably be best.

After thinking about this some more, I think the most performant thing to do will be to implement a native GroupsAccumulator (aka #6800 ) for median. With sufficient effort we could make median be very fast -- so I think this is a good improvement for now

Dandandan

LGTM

alamb

Thanks @vincev -- this looks great -- I will try and review this tomorrow

alamb · 2023-07-04T13:05:09Z

datafusion/physical-expr/src/aggregate/median.rs

-        let state =
-            ScalarValue::new_list(Some(self.all_values.clone()), self.data_type.clone());
+        let all_values = to_scalar_values(&self.batches)?;
+        let state = ScalarValue::new_list(Some(all_values), self.data_type.clone());


Maybe @Dandandan was suggesting

impl ScalarValue { ... List(ArrayRef) ... }

In general this would work well with the approach @tustvold is working on upstream in arrow-rs with Datum -- apache/arrow-rs#4393

alamb

I reviewed the code carefully -- thank you @vincev

I think the code could be merged in as is, but I also left some comments which I think would help.

datafusion/physical-expr/src/aggregate/median.rs

alamb · 2023-07-05T16:22:20Z

datafusion/physical-expr/src/aggregate/median.rs

-        let state =
-            ScalarValue::new_list(Some(self.all_values.clone()), self.data_type.clone());
+        let all_values = to_scalar_values(&self.batches)?;
+        let state = ScalarValue::new_list(Some(all_values), self.data_type.clone());


After thinking about this some more, I think the most performant thing to do will be to implement a native GroupsAccumulator (aka #6800 ) for median. With sufficient effort we could make median be very fast -- so I think this is a good improvement for now

datafusion/physical-expr/src/aggregate/median.rs

alamb · 2023-07-05T16:23:13Z

datafusion/physical-expr/src/aggregate/median.rs

    all_values: Vec<ScalarValue>,
 }

+fn to_scalar_values(arrays: &[ArrayRef]) -> Result<Vec<ScalarValue>> {
+    let num_values: usize = arrays.iter().map(|a| a.len()).sum();


💯 for computing the capacity up front

datafusion/physical-expr/src/aggregate/median.rs

alamb · 2023-07-05T18:48:33Z

Thanks @vincev

vincev · 2023-07-05T18:52:57Z

Thank you @alamb, @Dandandan for your feedback and review.

* Improve median performance. * Fix formatting. * Review feedback * Renamed arrays size.

Improve median performance.

9c49196

github-actions bot added the physical-expr Physical Expressions label Jul 4, 2023

vincev mentioned this pull request Jul 4, 2023

Improve the performance of Aggregator, grouping, aggregation #4973

Closed

4 tasks

Fix formatting.

4f9a540

Dandandan reviewed Jul 4, 2023

View reviewed changes

Dandandan approved these changes Jul 4, 2023

View reviewed changes

jackwener self-requested a review July 4, 2023 09:22

alamb reviewed Jul 4, 2023

View reviewed changes

alamb approved these changes Jul 5, 2023

View reviewed changes

vincev added 2 commits July 5, 2023 19:00

Review feedback

a308ba8

Renamed arrays size.

3fa6cd2

alamb approved these changes Jul 5, 2023

View reviewed changes

alamb merged commit c9a6fb8 into apache:main Jul 5, 2023

vincev deleted the median branch July 5, 2023 19:13

2010YOUY01 pushed a commit to 2010YOUY01/arrow-datafusion that referenced this pull request Jul 5, 2023

Improve median performance. (apache#6837)

bd3a20b

* Improve median performance. * Fix formatting. * Review feedback * Renamed arrays size.

alamb pushed a commit to alamb/datafusion that referenced this pull request Jul 6, 2023

Improve median performance. (apache#6837)

e8d5c17

* Improve median performance. * Fix formatting. * Review feedback * Renamed arrays size.

Dandandan mentioned this pull request Aug 21, 2023

Change ScalarValue::List to store ArrayRef #7352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve median performance. #6837

Improve median performance. #6837

vincev commented Jul 4, 2023 •

edited

Loading

Dandandan Jul 4, 2023

vincev Jul 4, 2023

alamb Jul 4, 2023

Dandandan Jul 4, 2023

vincev Jul 4, 2023

alamb Jul 5, 2023

Dandandan left a comment

alamb left a comment

alamb Jul 4, 2023

alamb left a comment

alamb Jul 5, 2023

alamb Jul 5, 2023

alamb commented Jul 5, 2023

vincev commented Jul 5, 2023

Improve median performance. #6837

Improve median performance. #6837

Conversation

vincev commented Jul 4, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 5, 2023

vincev commented Jul 5, 2023

vincev commented Jul 4, 2023 •

edited

Loading