-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve median performance. #6837
Conversation
let state = | ||
ScalarValue::new_list(Some(self.all_values.clone()), self.data_type.clone()); | ||
let all_values = to_scalar_values(&self.batches)?; | ||
let state = ScalarValue::new_list(Some(all_values), self.data_type.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can change ScalarValue::List
to use ArrayRef
instead of Vec<ScalarValue>
internally.
This would avoid quite some expensive conversions from/to ScalarValue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would help a lot, maybe we can add a ScalarValue::Array
variant to ScalarValue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @Dandandan was suggesting
impl ScalarValue {
...
List(ArrayRef)
...
}
In general this would work well with the approach @tustvold is working on upstream in arrow-rs with Datum
-- apache/arrow-rs#4393
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's what I was trying to suggest - this would avoid the need to convert / allocate to individual ScalarValue
s and convert to Array
later on again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work well, the reason I was suggesting adding an Array
variant instead of changing List
was to avoid changing all the code that depends on List(Option<Vec<ScalarValue>>, FieldRef)
, but yes long term that would probably be best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this some more, I think the most performant thing to do will be to implement a native GroupsAccumulator
(aka #6800 ) for median. With sufficient effort we could make median be very fast -- so I think this is a good improvement for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vincev -- this looks great -- I will try and review this tomorrow
let state = | ||
ScalarValue::new_list(Some(self.all_values.clone()), self.data_type.clone()); | ||
let all_values = to_scalar_values(&self.batches)?; | ||
let state = ScalarValue::new_list(Some(all_values), self.data_type.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe @Dandandan was suggesting
impl ScalarValue {
...
List(ArrayRef)
...
}
In general this would work well with the approach @tustvold is working on upstream in arrow-rs with Datum
-- apache/arrow-rs#4393
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the code carefully -- thank you @vincev
I think the code could be merged in as is, but I also left some comments which I think would help.
let state = | ||
ScalarValue::new_list(Some(self.all_values.clone()), self.data_type.clone()); | ||
let all_values = to_scalar_values(&self.batches)?; | ||
let state = ScalarValue::new_list(Some(all_values), self.data_type.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this some more, I think the most performant thing to do will be to implement a native GroupsAccumulator
(aka #6800 ) for median. With sufficient effort we could make median be very fast -- so I think this is a good improvement for now
all_values: Vec<ScalarValue>, | ||
} | ||
|
||
fn to_scalar_values(arrays: &[ArrayRef]) -> Result<Vec<ScalarValue>> { | ||
let num_values: usize = arrays.iter().map(|a| a.len()).sum(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for computing the capacity up front
Thanks @vincev |
Thank you @alamb, @Dandandan for your feedback and review. |
* Improve median performance. * Fix formatting. * Review feedback * Renamed arrays size.
* Improve median performance. * Fix formatting. * Review feedback * Renamed arrays size.
Which issue does this PR close?
Related to discussion in #4973.
Rationale for this change
This PR improves the median aggregator by reducing the number of allocations.
For this example I am using one year of data from the nyctaxi dataset, running a group by
payment_type
and computing thetotal_amount
median, with the current version it takes ~22secs:with the changes introduced in this PR we get the same result in ~2secs:
What changes are included in this PR?
Reduce number of allocations.
Are these changes tested?
Run tests locally they all passed.
Are there any user-facing changes?