Add min_by and max_by aggregate functions #12075

timsaucer · 2024-08-20T10:32:03Z

Is your feature request related to a problem or challenge?

It is a common need to get the value of one column such that another column is a minimum. For example, if I have a column of fruit_name and price_per_pound I might want to get the fruit_name for which price_per_pound is a minimum. This can be done with existing functions, but it should be both more performant and more user friendly to add these functions.

Describe the solution you'd like

Ideally this would take two expressions. The first would be the expression you want to return and the second would be the expression that we are looking for the min/max of. In my example we would do something like

min_by(col("fruit_name"), col("price_per_pound"))

Describe alternatives you've considered

Right now I would probably do a first_value with an order_by. This will introduce an unnecessary sort of the dataframe.

Additional context

Example from

https://docs.databricks.com/en/sql/language-manual/functions/min_by.html

The text was updated successfully, but these errors were encountered:

alamb · 2024-08-22T15:40:39Z

We have similar special functions for this type of function in InfluxDB (needed to get the max by time, for example) in case that helps this

We call them selector functions. Here are the docs https://docs.influxdata.com/influxdb/cloud-serverless/reference/sql/functions/selector/

Here is the code
https://github.com/influxdata/influxdb3_core/blob/main/query_functions/src/selectors/internal.rs

korowa · 2024-08-25T10:50:39Z

There are already aggregation variants of first/last which seem to solve the issue (example), and, at first glance, they do not perform normal sorting, only compare incoming ordering column values with "accumulated" ones.

It also seems that their performance could be improved by implementing GroupsAccumulator for them (or for majority of input types at least), as their state is somehow similar to AVGs state in terms of comlexity.

alamb · 2024-08-25T10:55:28Z

FIRST_VALUE / LAST_VALUE with ORDER BY is a great call. Here is the code if anyone is interested:

https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/nth_value.rs

korowa · 2024-08-25T11:17:30Z

More like https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/first_last.rs (I suppose)

korowa · 2024-08-25T11:22:52Z

My bad, it still sorts input data (during e.g. get_first_idx, which is likely a redundant sort, except for the cases when input is already sorted by ordering columns). That is another thing to improve.

timsaucer · 2024-08-25T12:37:45Z

As I'm thinking about it, I'm not sure you can get around doing the sort since you have an arbitrary number of ordering clauses. I think what you've proposed is the best option. One could speed up the operation when there's only a single order clause but that may not be worth the effort involved.

I'm okay closing this issue unless anyone thinks there's value in pursuing it further.

alamb · 2024-08-25T13:40:50Z

As I'm thinking about it, I'm not sure you can get around doing the sort since you have an arbitrary number of ordering clauses. I think what you've proposed is the best option. One could speed up the operation when there's only a single order clause but that may not be worth the effort involved.

FWIW this is what the first_selector / last_selector functions do in InfluxDB -- they basically maintain only the value for the largest ORDER BY column, rather than actually sorting the entire intermediate result

I'm okay closing this issue unless anyone thinks there's value in pursuing it further.

It would be nice to make it clearer how to get the equivalent of min_by and max_by -- maybe we could just document the equivalent first_value(... ORDER BY ...) queries 🤔 or we could automatically rewrite min_by / max_by into first_value() / last_value 🤔

It is probably also worth filing a ticket for the potential optimization

korowa · 2024-08-25T16:33:24Z

I guess user guide will be the best option since naming for this function may vary in different DBs -- min_by / argmin / argmin_agg, and maybe couple of others, so aliasing is not that straightforward.

dharanad · 2024-09-27T15:29:27Z

I think we can close this ticket

alamb · 2024-09-28T16:10:03Z

Thanks everyone

timsaucer added the enhancement New feature or request label Aug 20, 2024

This was referenced Sep 1, 2024

Support max_by in Aggregation function #12252

Closed

Support min_by in Aggregation function #12253

Closed

timsaucer closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add min_by and max_by aggregate functions #12075

Add min_by and max_by aggregate functions #12075

timsaucer commented Aug 20, 2024

alamb commented Aug 22, 2024 •

edited

Loading

korowa commented Aug 25, 2024 •

edited

Loading

alamb commented Aug 25, 2024

korowa commented Aug 25, 2024

korowa commented Aug 25, 2024 •

edited

Loading

timsaucer commented Aug 25, 2024

alamb commented Aug 25, 2024

korowa commented Aug 25, 2024

dharanad commented Sep 27, 2024

alamb commented Sep 28, 2024

Add min_by and max_by aggregate functions #12075

Add min_by and max_by aggregate functions #12075

Comments

timsaucer commented Aug 20, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Aug 22, 2024 • edited Loading

korowa commented Aug 25, 2024 • edited Loading

alamb commented Aug 25, 2024

korowa commented Aug 25, 2024

korowa commented Aug 25, 2024 • edited Loading

timsaucer commented Aug 25, 2024

alamb commented Aug 25, 2024

korowa commented Aug 25, 2024

dharanad commented Sep 27, 2024

alamb commented Sep 28, 2024

alamb commented Aug 22, 2024 •

edited

Loading

korowa commented Aug 25, 2024 •

edited

Loading

korowa commented Aug 25, 2024 •

edited

Loading