-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add min_by and max_by aggregate functions #12075
Comments
We have similar special functions for this type of function in InfluxDB (needed to get the max by time, for example) in case that helps this We call them selector functions. Here are the docs https://docs.influxdata.com/influxdb/cloud-serverless/reference/sql/functions/selector/ Here is the code |
There are already aggregation variants of first/last which seem to solve the issue (example), and, at first glance, they do not perform normal sorting, only compare incoming ordering column values with "accumulated" ones. It also seems that their performance could be improved by implementing GroupsAccumulator for them (or for majority of input types at least), as their state is somehow similar to AVGs state in terms of comlexity. |
|
My bad, it still sorts input data (during e.g. |
As I'm thinking about it, I'm not sure you can get around doing the sort since you have an arbitrary number of ordering clauses. I think what you've proposed is the best option. One could speed up the operation when there's only a single order clause but that may not be worth the effort involved. I'm okay closing this issue unless anyone thinks there's value in pursuing it further. |
FWIW this is what the
It would be nice to make it clearer how to get the equivalent of It is probably also worth filing a ticket for the potential optimization |
I guess user guide will be the best option since naming for this function may vary in different DBs -- min_by / argmin / argmin_agg, and maybe couple of others, so aliasing is not that straightforward. |
I think we can close this ticket |
Thanks everyone |
Is your feature request related to a problem or challenge?
It is a common need to get the value of one column such that another column is a minimum. For example, if I have a column of fruit_name and price_per_pound I might want to get the fruit_name for which price_per_pound is a minimum. This can be done with existing functions, but it should be both more performant and more user friendly to add these functions.
Describe the solution you'd like
Ideally this would take two expressions. The first would be the expression you want to return and the second would be the expression that we are looking for the min/max of. In my example we would do something like
min_by(col("fruit_name"), col("price_per_pound"))
Describe alternatives you've considered
Right now I would probably do a first_value with an order_by. This will introduce an unnecessary sort of the dataframe.
Additional context
Example from
https://docs.databricks.com/en/sql/language-manual/functions/min_by.html
The text was updated successfully, but these errors were encountered: