Improve model service autoscaling #3496

achimnol · 2025-01-17T15:01:12Z

Motivation

Let’s follow-up some incomplete corners of https://lablup.atlassian.net/browse/BA-96

Define clearer priority semantics when there are multiple rules to be triggered at the same time. Currently only the first matched rule is evaluated, but if there are multiple rules observing different metrics, they need to be evaluated in a single iteration and somehow the results must be combined.
1. We could consider having a more sincerely designed validation of autoscaling rules for a single endpoint. For instance, only a single pair of increasing/decreasing rules may exist against a single metric. If so, we could group the rules by metrics and evaluate each group simultaneously, and prioritize their results using a configured order.
Support additional aggregation operators when collecting metrics from multiple replica sessions and kernels, as currently we have only “average”. (e.g., min, max)
1. Like idle checkers, we need to consider having time-based, windowed metric smoothing.
2. Users would want to have a GUI to see the current metrics.
Leave user-queryable explicit audit logging of the scaling decisions.
Consider adding the endpoint-level cool-down, in addition to individual rules.
Allow disabling a specific autoscaling rule without deleting it.