Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array Function coverage for various aggregations #33608

Closed
rcauble opened this issue Oct 24, 2023 · 1 comment
Closed

Array Function coverage for various aggregations #33608

rcauble opened this issue Oct 24, 2023 · 1 comment

Comments

@rcauble
Copy link

rcauble commented Oct 24, 2023

Feature request

Is your feature request related to a problem? Please describe.
For context, we have an optimization whereby we pre-aggregate one table onto another in array form. However, once we have the array, we need to be able to perform various aggregations as array functions. It would be nice if all aggregate functions had an array function with equivalent semantics.

Describe the solution you'd like
Here is a list of cases we've encountered where we are missing the needed functionality:

  1. multi-column array_sortby. This is needed to emulate the functionality of array_agg where we need to specify multiple columns with ascending and descending. Need something that takes a list of columns to sort by and ascending/descending specifiers.
  2. array_percentile_cont and array_percentile_disc: same semantics as percentile_cont and percentild_disc
  3. array_std_samp: should have same semantics as the STDDEV_SAMP aggregation
  4. array_var_samp: should have same semantics as the VAR_SAMP aggregation

We currently have workaround for the following, but it would be nice to have these for efficiency sake:

  1. array_count_distinct that ignores nulls. We implement this as array_length(array_distinct(array_filter(x)->x is not null))
  2. array_bool_or. Should be equivalent to the aggregation in this FR: Support for BOOLOR_AGG, BOOLXOR_AGG, BOOLAND_AGG aggregate function #22949
  3. array_count that ignores nulls. We implement this as array_length(array_filter(x)->x is not null).
  4. array_avg that ignores nulls. We implement this as array_avg(array_filter(x)->x is not null).
  5. array_first that returns the first non-null element of the array. We implement this as (array_filter(x)->x is not null))[1]
  6. array_last that returns the first non-null element of the array. We implement this as (array_filter(x)->x is not null))[array_length(array_filter(x)->x is not null)]

Describe alternatives you've considered
See the second set of functions listed above. We have workarounds for those, but they are complex and sub-optimal since they require materializing an array when not otherwise needed if we had direct support.

The first set of functions doesn't have a reasonable workaround. We would need to unnest, aggregate, and re-nest which is extremely costly.

Additional context

Copy link

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants