Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global ratelimiter helper: usage-tracking fallback-capable rate.Limiter #6028

Merged
merged 2 commits into from
May 24, 2024

Conversation

Groxx
Copy link
Member

@Groxx Groxx commented May 18, 2024

This commit largely exists to break this out for separate review, as it's relatively simple and isolated.

In a later PR, this will be used by the "limiter"-side logic to track and enforce usage of a ratelimit. At a high level it does only a couple things:

  1. Tracks calls to Allow() bool so we can find out how much a limit is actually used (so we can distribute the allowances fairly)
  2. Updates ratelimit rates when requested
  3. Automatically falls back to different ratelimit if the system seems unhealthy

^ all of which is done in an atomic-like way so concurrent operations do not block each other, as true precision is not necessary. It only has to be "good enough", and performance is more important than precision.

@Groxx Groxx changed the title Global ratelimiter helper: usage-tracking fallback-capable rate.Limit Global ratelimiter helper: usage-tracking fallback-capable rate.Limiter May 18, 2024
@Groxx Groxx marked this pull request as ready for review May 18, 2024 21:21
Copy link

codecov bot commented May 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.32%. Comparing base (02c7efb) to head (6369bc2).
Report is 24 commits behind head on master.

Additional details and impacted files
Files Coverage Δ
...ommon/quotas/global/collection/internal/limiter.go 100.00% <100.00%> (ø)

... and 21 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02c7efb...6369bc2. Read the comment docs.

@Groxx
Copy link
Member Author

Groxx commented May 21, 2024

this might change if we go with #6030, e.g. it could be implemented as a wrapper and then it could collect accurate reserve->cancel metrics too.

I'm not sure if that's currently necessary - I believe all our reserve->cancel logic uses per-host limiters as the first level, rather than something intended to be global. but that does seem like something we will eventually change, so it'd be risky to leave without at least documenting.

// - this limit is legitimately unused and no data exists for it
// - ring re-sharding led to losing track of the limit, and it is now owned by a host with insufficient data
//
// To mitigate the impact of the second case, "insufficient data" responses from aggregating hosts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, it's worth adding some metrics tracking when we fallback

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'm planning on metrics. that'll be outside these objects though, and it's why Collect() returns the "use fallback" bool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a transition from non-fallback to fallback or vice versa, the counter will be inaccurate.

Copy link
Member Author

@Groxx Groxx May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you mean the limiting behavior will be inaccurate, e.g. it'll allow / deny more requests than it should:

yes, but we don't have any great way to resolve that. *rate.Limiter doesn't allow directly manipulating its tokens, and there isn't even a true "correct" behavior if this happens because it implies we don't know the correct behavior - we haven't been able to sync with other instances.

it'll only be incorrect for one "time to fill the ratelimiter's burstable tokens" period though, which is currently 1 second (limit == burst in basically all of our uses). generally speaking that'd mean ~2x overage at worst, assuming all hosts switched at the same time.


I could try to empty the tokens when switching, which would be safer for us because it wouldn't allow exceeding either limit, but we don't have an efficient API for that - ideally it'd get tokens and try to e.g. binary-search Allow(n) to empty most or all of them, since it's racing with other calls. Right now it'd need around thousands of calls per limiter to drain, which is definitely excessive.

The #6030 wrapper could do this relatively precisely and efficiently, as it can lock everything while draining. If we think 1 second of overage is worth preventing, I suspect it'd be relatively easy to add a "drain the limiter" API for this kind of purpose. I'm... ~50% "yes, worth it"? Can definitely be convinced to build that.

Copy link
Member Author

@Groxx Groxx May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr from IRL, from clarifying some of the surroundings here:

a layer above this, which holds all these individual tracking-and-fallback-able limiters, will be periodically calling Collect() and it'll emit a counter for "using fallback" so we can monitor how often fallbacks are used.

but "uses fallback" should only occur in practice:

  • temporarily after startup / when a key is first used (no data, must use fallback)
    • this might be worth tracking separately, the failed-update counter can show when this is happening (negative value)
  • when the global limiter is disabled and rejects all requests (falling back is desired here)
  • when something cataclysmic is happening and global limit data is being lost rapidly, e.g. many host crashes -> ring changes rapidly -> does not return weights before crashing too. (falling back is desirable here too, as the old values are old enough to not trust)

so the sustained existence of fallback-using things when we expect the global ratelimiter to be working is a concern, but not individual "failed update" events or temporary use.

the exact metrics emitted are going to be in a later diff, that all happens outside of this object.

@Groxx Groxx merged commit 0aac302 into cadence-workflow:master May 24, 2024
20 checks passed
@Groxx Groxx deleted the ratelimiter-limiter branch May 24, 2024 22:21
timl3136 pushed a commit to timl3136/cadence that referenced this pull request Jun 6, 2024
…er (cadence-workflow#6028)

This commit largely exists to break this out for separate review, as it's relatively simple and isolated.

In a later PR, this will be used by the "limiter"-side logic to track and enforce usage of a ratelimit.  At a high level it does only a couple things:

1. Tracks calls to `Allow() bool` so we can find out how much a limit is actually used (so we can distribute the allowances fairly)
2. Updates ratelimit rates when requested
3. Automatically falls back to different ratelimit if the system seems unhealthy

^ all of which is done in an atomic-like way so concurrent operations do not block each other, as true precision is not necessary.  It only has to be "good enough", and performance is more important than precision.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants