Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(BA-96): Metric-based model service autoscaling #3277

Merged
merged 86 commits into from
Jan 17, 2025

Conversation

kyujin-cho
Copy link
Member

@kyujin-cho kyujin-cho commented Dec 19, 2024

Resolves #2659 (BA-96)

What's changed

  • This PR adds the user-configurable auto-scaling rules associated with each model-service endpoint.
  • The scale_services() function now interprets the configured rules and applies their decisions by changing the desired replica count.
    • The original scale_services() logic that fills the difference of the current replica count to the desired replica count remains the same.
  • Each rule can either increase or decrease the desired number of replicas by positive/negative step_size, meaning that it is a single-direction trigger. To auto-scale the replicas in bidirectional ways, users must define at least two rules.
    • There is no explicit validation or warnings for the auto-scaling rules if they contradict each other. It is the user's responsibility to configure a consistent set of rules for a single metric or combinations of them.
flowchart TD
    A1["Auto-scaling rule 1 (GREATER_THAN...)"] -->|"(+) step_size"| Count(desired replica count)
    A2["Auto-scaling rule 2 (LESS_THAN...)"] -->|"(−) step_size"| Count

    Count -->|apply difference to current replica count| E[Endpoint]
    
    U[User] -->|manually set| Count
Loading

Other changes

  • This PR separates the aiodataloader handler from the bulk loading mechanism in both EndpointStatistics and KernelStatistics to enable reuse of the bulk loader.

How it works

  • Every endpoint (model service) can have one or more auto scaling rules
  • Auto scaling rule is defined as:
    • Metric source - inference runtime or kernel
      • inference framework: average value taken from every replicas. Supported only if both AppProxy reports the inference metrics. Check Backend.AI Enterprise guide for more details.
      • kernel: average value taken from every kernels backing the endpoint
    • Metric name (e.g. cuda.shares or vllm_avg_prompt_throughput_toks_per_s)
    • Comparator - method to compare live metrics with threshold value.
      • LESS_THAN: Rule triggered when current metric value goes below the threshold defined
      • LESS_THAN_OR_EQUAL: Rule triggered when current metric value goes below or equals the threshold defined
      • GREATER_THAN: Rule triggered when current metric value goes above the threshold defined
      • GREATER_THAN_OR_EQUAL: Rule triggered when current metric value goes above or equals the threshold defined
    • Step size: size of step of the replica count to be changed when rule is triggered. Can be represented as both positive and negative value - when defined as negative, the rule will decrease number of replicas.
    • Cooldown Seconds: Durations in seconds to skip reapplying the rule right after rule is first triggered.
    • Minimum Replicas: Sets a minimum value for the replica count of the endpoint. Rule will not be triggered if the potential replica count gets below this value.
    • Maximum Replicas: Sets a maximum value for the replica count of the endpoint. Rule will not be triggered if the potential replica count gets above this value.

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue

📚 Documentation preview 📚: https://sorna--3277.org.readthedocs.build/en/3277/


📚 Documentation preview 📚: https://sorna-ko--3277.org.readthedocs.build/ko/3277/

@github-actions github-actions bot added area:docs Documentations comp:manager Related to Manager component require:db-migration Automatically set when alembic migrations are added or updated labels Dec 19, 2024
@kyujin-cho kyujin-cho added type:feature Add new features and removed area:docs Documentations comp:manager Related to Manager component require:db-migration Automatically set when alembic migrations are added or updated labels Dec 19, 2024
@github-actions github-actions bot added size:L 100~500 LoC comp:agent Related to Agent component comp:appproxy Related to App Proxy component comp:manager Related to Manager component urgency:5 It is imperative that action be taken right away. labels Dec 19, 2024
@kyujin-cho kyujin-cho added this to the 24.12 milestone Dec 19, 2024
@kyujin-cho kyujin-cho changed the title feature: model service autoscaling feat: model service autoscaling Dec 19, 2024
@kyujin-cho kyujin-cho changed the title feat: model service autoscaling feat: metric based model service autoscaling Dec 19, 2024
@kyujin-cho kyujin-cho marked this pull request as ready for review December 19, 2024 17:08
@kyujin-cho kyujin-cho force-pushed the feature/model-service-autoscale branch from 2e9102e to 9bc0661 Compare December 20, 2024 12:12
@achimnol achimnol added this pull request to the merge queue Jan 17, 2025
@achimnol achimnol modified the milestones: 24.12, 25Q1 Jan 17, 2025
Merged via the queue into main with commit 9be8899 Jan 17, 2025
23 checks passed
@achimnol achimnol deleted the feature/model-service-autoscale branch January 17, 2025 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:docs Documentations comp:agent Related to Agent component comp:appproxy Related to App Proxy component comp:cli Related to CLI component comp:client Related to Client component comp:common Related to Common component comp:manager Related to Manager component require:db-migration Automatically set when alembic migrations are added or updated size:XL 500~ LoC type:feature Add new features urgency:blocker IT SHOULD BE RESOLVED BEFORE NEXT RELEASE! urgency:5 It is imperative that action be taken right away.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support auto scaling on Model Service
3 participants