Global ratelimiter: improve low-weight low-usage scenarios #6238

Groxx · 2024-08-19T21:56:01Z

Part 1 of 2 to address some relatively-expected "infrequent usage behaves sub-optimally" issues.

In this one, I'm tackling an issue where there are some high-usage limiting-hosts, and some occasional-and-low-usage hosts, and e.g. less than half of the target RPS is being used at any time.

In ^ this scenario it's possible for the low-usage hosts to be given a weight so low that they block most of the requests they receive, especially if those requests come in small bursts (which is quite common) rather than evenly distributed. This is partly because their weight is low enough to only allow a couple RPS, and partly because our limiters currently set their burst size == limit, so they also do not allow bursts that would otherwise be within-budget.

This cannot be addressed perfectly with an async system like this (we can't perfectly predict caller behavior), but there are some easy partial improvements worth building:

(this PR) Allow low-weight hosts to over-commit and accept more than their fair share of RPS, if there is free RPS in the cluster.
- For consistently low-RPS hosts, this should be fair and a moderate improvement, and it performs no worse than a non-global algorithm if the limit is not usually fully used.
Raise the burst size, so intermittent callers can accumulate a small burst budget.
- Currently this matches the RPS (min 1), which means e.g. a totally innocuous 2-request-event every 10 seconds will likely reject 50%. Obviously that's not great.
- "Fair" values here are fuzzier, and will be figured out in a later PR.

In this PR, I "boost" RPSes that are below the fallback values, in a way that converges towards the target RPS as usage increases. This allows some over-committing, but we already allow that, and the fallback values are assumed to be "safe enough": they'd be used if the caller just waited a bit longer between calls -> the global limiter was garbage-collected.

If global usage increases, the boost-amount will shrink, and the over-commit will shrink as well. So while this allows spiky requests through at a somewhat higher rate than desired, heavy usage should still converge near the target RPS rather than e.g. 2x the target.

A lot of lines have changed, but this PR boils down to:

change the IDL to return more data (which was already tracked) as a new type
- this will break the current system, causing the global limiters to use their fallback
- ^ this is currently intentional, to serve as a test of that fallback, but should not generally be done. normally it should handle both types, or shift to a plugin system and just have two stacks in use, switched by type.
change the algorithm portion to return a struct rather than 2 values, to make this additional data easier to pass around
boost RPS when updating the ratelimiters, in the collection
a lot of minor changes to handle the new type, without changing behavior

Visually, for a hopefully more intuitive grasp of the issue, this is how weights/RPS currently get distributed among a number of hosts:

Low weights get a low ratio of the allowed RPS, so they are also only allowed to grow a little, while high-weight hosts are allowed to grow a lot. If we assume traffic arrives at frontends roughly randomly, this is fine (if they grow, they will all grow by the same %), but in practice we see a lot of "constant high use" to some hosts, and "occasional low use" to others, e.g. due to different services performing different call patterns and being routed differently.

Mathematically this makes sense... but it's relatively easy for something that only sends a couple requests every 10 seconds or so to eventually be allocated ~1 RPS because its weight is legitimately that low. But then nearly all of those low-RPS requests are rejected despite there (possibly) being more than enough free-RPS in the cluster.

This PR essentially changes the behavior to look like this:

If there is unused-RPS in the cluster, low-weight hosts are allowed to increase their RPS up to their fallback-behavior's limit. This is limited by the amount of unused-RPS, so as usage increases it will move towards the un-boosted limit (i.e. weight only), and the overall RPS should converge at the target RPS.

Under high usage, it looks more like this (only a small boost added to low weight hosts):

How the unused RPS is best-distributed is... fuzzy. Here I've taken the easy and low-data-needed option: just use it all, independently, in each low-weight host. This is more likely to prevent unfairly-blocked requests, but allows more requests in a multi-host burst (until re-balanced).
Another option considered (and originally planned) was to distribute it between all contributing low-weight hosts, but this needs to either exchange more data or put target-RPS knowledge into both the limiter and the aggregator side of this system. I think the latter will eventually happen and it's basically fine if/when it does, but it's a larger change than I wanted to do right now, and the former just didn't feel worth the extra logical complexity. Simpler is better with this kind of system, because it's more understandable and predictable.

The original idea has a simulated implementation here, with a lot of tests to run a simulated traffic pattern and log how weights / individual hosts / etc behave: #6196
but that will likely not be merged - that PR served its purpose of showing whether or not my intuition about the system's behavior as a whole was correct, and the large chunk of code to do that is probably not worth maintaining.

…it based on fallback limit

A rather minor but wide-touching change, but it will be useful for a later PR.

codecov · 2024-08-19T23:05:36Z

Codecov Report

Attention: Patch coverage is 91.48936% with 4 lines in your changes missing coverage. Please review.

Project coverage is 73.02%. Comparing base (6ba8ad8) to head (86dfe12).
Report is 1 commits behind head on master.

Files	Patch %	Lines
common/quotas/global/rpc/client.go	50.00%	1 Missing and 1 partial ⚠️
common/quotas/global/collection/collection.go	94.11%	0 Missing and 1 partial ⚠️
common/quotas/global/rpc/mapping.go	93.33%	1 Missing ⚠️

Additional details and impacted files

Files	Coverage Δ
common/quotas/global/algorithm/requestweighted.go	`95.53% <100.00%> (-0.03%)`	⬇️
...mmon/quotas/global/collection/internal/fallback.go	`96.49% <100.00%> (+0.12%)`	⬆️
service/history/handler/handler.go	`88.13% <100.00%> (ø)`
common/quotas/global/collection/collection.go	`85.02% <94.11%> (+1.00%)`	⬆️
common/quotas/global/rpc/mapping.go	`83.33% <93.33%> (+0.98%)`	⬆️
common/quotas/global/rpc/client.go	`73.84% <50.00%> (-2.35%)`	⬇️

... and 4 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6ba8ad8...86dfe12. Read the comment docs.

Groxx · 2024-08-23T00:25:38Z

common/quotas/global/collection/collection.go

+			target := rate.Limit(c.targetRPS(lkey))
+			limiter := c.global.Load(lkey)
+			fallbackTarget := limiter.FallbackLimit()
+			boosted := boostRPS(target, fallbackTarget, info.Weight, info.UsedRPS)
+			limiter.Update(boosted)


everything else is in service of this change here: boost limits before applying them.

davidporter-id-au

At first glance I can't fault anything here, I'll try and take another look again tomorrow when less tired.

I appreciate the effort you've gone to explaining visually, mathematically what you're thinking and I'm sure future engineers will similarly. This sets a high bar for us.

IDL changes needed to build cadence-workflow/cadence#6238 Basically this is just: return the "used RPS" that was already being tracked. For compatibility purposes: while deploying, this will lead to version-crossing requests failing due to an unknown type, which will eventually put those keys into the "failing" mode. This is currently _intended_, to serve as a full test of that fallback system, but should not be done in general. Due to a different test (broken hashring) we have already seen the fallback system kick in correctly, so I'm fairly confident that this will work as intended. Still, it's probably worth testing, particularly because it's easy. --- This only changes thrift because this type is serialized _outside_ RPC, just like our database types. There is no proto equivalent, at best it'd be these-serialized-thrift-bytes sent in a proto wrapper.

davidporter-id-au · 2024-08-23T23:53:24Z

service/history/handler/handler_test.go

+		map[shared.GlobalKey]rpc.UpdateEntry{
+			"test:domain-user-limit": {
+				Weight: 1,
+				// re 10 vs 15: used RPS only tracks accepted.


not sure what 15 refers to here?

update, err := rpc.TestUpdateToAny(t, "testhost", time.Second, map[shared.GlobalKey]rpc.Calls{ "test:domain-user-limit": { Allowed: 10, Rejected: 5, }, })

^ I inserted 10 allowed and 5 rejected calls before querying for weights

davidporter-id-au · 2024-08-24T00:40:45Z

I would suggest putting The PR docs probably need to be thrown into a MD file, describing hwo to think about the rate-limiter's tradeoffs and algorithm

Groxx · 2024-08-27T18:18:42Z

I would suggest putting The PR docs probably need to be thrown into a MD file, describing hwo to think about the rate-limiter's tradeoffs and algorithm

yea, I think I'll do that once I'm done with the changes and it's fully "on" in production. better than delaying things at the moment :|

…orkflow#6238) Part 1 of 2 to address some relatively-expected "infrequent usage behaves sub-optimally" issues. In this one, I'm tackling an issue where there are some high-usage limiting-hosts, and some occasional-and-low-usage hosts, and e.g. less than half of the target RPS is being used at any time. In ^ this scenario it's possible for the low-usage hosts to be given a weight so low that they block _most_ of the requests they receive, especially if those requests come in small bursts (which is quite common) rather than evenly distributed. This is partly because their weight is low enough to only allow a couple RPS, and partly because our limiters currently set their burst size == limit, so they also do not allow bursts that would otherwise be within-budget. This _cannot_ be addressed perfectly with an async system like this (we can't perfectly predict caller behavior), but there are some easy partial improvements worth building: 1. (this PR) Allow low-weight hosts to over-commit and accept more than their fair share of RPS, if there is free RPS in the cluster. - For _consistently_ low-RPS hosts, this should be fair and a moderate improvement, and it performs no worse than a non-global algorithm if the limit is not usually fully used. 2. Raise the burst size, so intermittent callers can accumulate a small burst budget. - Currently this matches the RPS (min 1), which means e.g. a totally innocuous 2-request-event every 10 seconds will likely reject 50%. Obviously that's not great. - "Fair" values here are fuzzier, and will be figured out in a later PR. In this PR, I "boost" RPSes that are below the fallback values, in a way that converges towards the target RPS as usage increases. This allows some over-committing, but we already allow that, and the fallback values are assumed to be "safe enough": they'd be used if the caller just waited a bit longer between calls -> the global limiter was garbage-collected. If global usage increases, the boost-amount will shrink, and the over-commit will shrink as well. So while this allows spiky requests through at a somewhat higher rate than desired, heavy usage should still converge near the target RPS rather than e.g. 2x the target. --- A lot of lines have changed, but this PR boils down to: - change the IDL to return more data (which was already tracked) as a new type - this will break the current system, causing the global limiters to use their fallback - ^ this is currently intentional, to serve as a test of that fallback, but should not generally be done. normally it should handle both types, or shift to a plugin system and just have two stacks in use, switched by type. - change the `algorithm` portion to return a struct rather than 2 values, to make this additional data easier to pass around - boost RPS when updating the ratelimiters, in the `collection` - a lot of minor changes to handle the new type, without changing behavior --- Visually, for a hopefully more intuitive grasp of the issue, this is how weights/RPS currently get distributed among a number of hosts: ![ratelimiter - orig](https://github.com/user-attachments/assets/0956a507-cd22-4874-843e-7d987d1a32b9) Low weights get a low ratio of the allowed RPS, so they are also only allowed to _grow_ a little, while high-weight hosts are allowed to grow a lot. If we assume traffic arrives at frontends roughly randomly, this is fine (if they grow, they will all grow by the same %), but in practice we see a lot of "constant high use" to some hosts, and "occasional low use" to others, e.g. due to different services performing different call patterns and being routed differently. Mathematically this makes sense... but it's relatively easy for something that only sends a couple requests every 10 seconds or so to eventually be allocated ~1 RPS because its weight is legitimately that low. But then nearly all of those low-RPS requests are rejected despite there (possibly) being more than enough free-RPS in the cluster. This PR essentially changes the behavior to look like this: ![ratelimiter - changed](https://github.com/user-attachments/assets/0193bfdf-687c-4b6e-805b-78d411c65a6f) If there is unused-RPS in the cluster, low-weight hosts are allowed to increase their RPS up to their fallback-behavior's limit. This is limited by the amount of unused-RPS, so as usage increases it will move towards the un-boosted limit (i.e. weight only), and the overall RPS should converge at the target RPS. Under high usage, it looks more like this (only a small boost added to low weight hosts): ![ratelimiter - changed high usage](https://github.com/user-attachments/assets/d005541f-4f82-4344-924f-a61159805481) _How_ the unused RPS is best-distributed is... fuzzy. Here I've taken the easy and low-data-needed option: just use it all, independently, in each low-weight host. This is more likely to prevent unfairly-blocked requests, but allows more requests in a multi-host burst (until re-balanced). Another option considered (and originally planned) was to distribute it between all *contributing* low-weight hosts, but this needs to either exchange more data or put target-RPS knowledge into both the limiter and the aggregator side of this system. I think the latter will eventually happen and it's basically fine if/when it does, but it's a larger change than I wanted to do right now, and the former just didn't feel worth the extra logical complexity. Simpler is better with this kind of system, because it's more understandable and predictable. The original idea has a simulated implementation here, with a lot of tests to run a simulated traffic pattern and log how weights / individual hosts / etc behave: cadence-workflow#6196 but that will likely not be merged - that PR served its purpose of showing whether or not my intuition about the system's behavior as a whole was correct, and the large chunk of code to do that is probably not worth maintaining.

…r new ratelimiters # Motivation: The global ratelimiter system was exhibiting some weird request-rejection at very low RPS usage. Previously it was thought this was just due to irrationally-low weights, and cadence-workflow#6238 addressed that (and is still desirable). After that was rolled out, behavior improved, but small numbers still occurred... which should not have happened because the "boosting" logic should have meant that the global limits were *at least* identical, and possibly larger. And then I found this PR's issue. # Issue and fix What was happening is that the initial `rate.NewLimiter(0,0)` was "leaking" into limits after the first update, so a request that occurred immediately after would likely be rejected, regardless of the configured limit. This happens because `(0, 0)` creates a zero-burst limit on the "primary" limiter, and the shadowed `.Allow()` calls were advancing the limiter's internal "now" value... ... and then when the limit and burst were increased, the limiter would have to fill from zero. This put it in a worse position than local / fallback limiters, which start from `(local, local)` with a zero "now" value, and then the next `.Allow()` is basically guaranteed to fill the token bucket due to many years "elapsing". The fix has two parts: 1: Avoid advancing the un-initialized limiter's internal time until a reasonable limit/burst has been set. This is done by simply not calling it while in startup mode. 2: Avoid advancing limiters' time when setting limit and burst. This means that after an idle period -> `Update()` -> `Allow()`, tokens will fill as if they were set all along, and the setters can be called in any order. The underlying `rate.Limiter` does *not* do this, it advances time when setting these... but that seems undesirable. It means old values are preferred (which is reasonable - they were set when that time passed), *and* it means that the order you call these has a significant impact on the outcome, even with the same values and the same timing. I can only see that as surprising, and worth avoiding. # Alternative approach 2 seems worth keeping. But 1 has a relatively clear alternative: Don't create the "primary" limiter until the first `Update()`. Because it's currently atomic-oriented, this can't be done safely without adding atomics or locks everywhere... so I didn't do that. If I were to do this, I would just switch to a mutex, the `rate.Limiter` already uses them so it should be near zero cost. I'm happy to build that if someone prefers, I just didn't bother this time.

…r new ratelimiters (#6280) # Motivation: The global ratelimiter system was exhibiting some weird request-rejection at very low RPS usage. On our dashboards it looks like this: <img width="927" alt="Screenshot 2024-09-11 at 18 55 09" src="https://github.com/user-attachments/assets/8236d945-0f8f-45f9-9a9c-5c908f386ae4"> Previously I thought this was just due to undesirably-low weights, and #6238 addressed that (and is still a useful addition). After that was rolled out, behavior improved, but small numbers still occurred... which should not have happened because the "boosting" logic should have meant that the global limits were *at least* identical, and likely larger. Which drove me to re-read the details and think harder. And then I found this PR's issue. # Issue and fix What was happening is that the initial `rate.NewLimiter(0,0)` detail was "leaking" into limits after the first update, so a request that occurred immediately after would likely be rejected, regardless of the configured limit. This happens because `(0, 0)` creates a zero-burst limit on the "primary" limiter, and the shadowed `.Allow()` calls were advancing the limiter's internal "now" value... ... and then when the limit and burst were increased, the limiter would have to fill from zero. This put it in a worse position than local / fallback limiters, which start from `(local, local)` with a zero "now" value, and then the next `.Allow()` is basically guaranteed to fill the token bucket due to many years "elapsing". So the fix has two parts: 1: Avoid advancing the zero-valued limiter's internal time until a reasonable limit/burst has been set. This is done by simply not calling it while in startup mode. 2: Avoid advancing limiters' time when setting limit and burst. This means that after an idle period -> `Update()` -> `Allow()`, tokens will fill as if the new values were set all along, and the setters can be called in any order. The underlying `rate.Limiter` does *not* do the second, it advances time when setting these... but that seems undesirable. It means old values are preferred (which is reasonable, they were set when that time passed), *and* it means that the order you call to set both burst and limit has a significant impact on the outcome, even with the same values and the same timing: time passes only on the first call, the second has basically zero elapsed and has no immediate effect at all (unless lowering burst). I can only see that latter part as surprising, and definitely worth avoiding. # Alternative approach 2 seems worth keeping. But 1 has a relatively clear alternative: Don't create the "primary" limiter until the first `Update()`. Because it's currently atomic-oriented, this can't be done safely without adding atomics or locks everywhere... so I didn't do that. If I were to do this, I would just switch to a mutex, the `rate.Limiter` already uses them so it should be near zero cost. I'm happy to build that if someone prefers, I just didn't bother this time.

Groxx added 12 commits August 16, 2024 16:16

saving place, mostly builds, needs to actually start boosting the lim…

a743702

…it based on fallback limit

Use rate.Limit instead of untyped floats in ratelimiters

7eb716d

A rather minor but wide-touching change, but it will be useful for a later PR.

type cast for testing. sadly generics cannot prevent this one either...

971b0e6

reduce the blast radius

7e9e5b8

missed one

46854d3

coverage

4e268ed

Merge branch 'quota-limit-api' into ratelimiter-degrade-fix

6fe4195

fix

b3ea12f

bleh

87d8aed

minor formatting

6b09286

maybe done

3f525d3

test fix

99e4aa3

Groxx added 6 commits August 19, 2024 19:01

more coverage

6323f62

Merge branch 'quota-limit-api' into ratelimiter-degrade-fix

4e447d7

Merge branch 'master' into ratelimiter-degrade-fix

13894a1

update idl submodule too, though it is a noop here

4a500de

had some awkward merges in idl, squashed and fixed

6f2291e

fix go module

2590a26

Groxx mentioned this pull request Aug 23, 2024

Add a usage-data return type to ratelimiters cadence-workflow/cadence-idl#181

Merged

Groxx marked this pull request as ready for review August 23, 2024 00:12

Groxx requested review from Shaddoll, neil-xie, davidporter-id-au, shijiesheng, agautam478, jakobht, 3vilhamster, sankari165 and dkrotx as code owners August 23, 2024 00:12

Groxx requested review from taylanisikdemir and demirkayaender as code owners August 23, 2024 00:12

Groxx commented Aug 23, 2024

View reviewed changes

comment improvements

119de90

davidporter-id-au reviewed Aug 23, 2024

View reviewed changes

move to merged idl version

84405a0

davidporter-id-au approved these changes Aug 24, 2024

View reviewed changes

Groxx added 2 commits August 27, 2024 14:38

merge, update gomod and idls

f5ef809

fix gomod version, oops

86dfe12

Groxx merged commit 2bf77c9 into cadence-workflow:master Aug 27, 2024
20 checks passed

Groxx mentioned this pull request Sep 12, 2024

Ratelimiter polish / fix: improve zero -> nonzero filling behavior for new ratelimiters #6280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global ratelimiter: improve low-weight low-usage scenarios #6238

Global ratelimiter: improve low-weight low-usage scenarios #6238

Groxx commented Aug 19, 2024 •

edited

Loading

codecov bot commented Aug 19, 2024 •

edited

Loading

Groxx Aug 23, 2024

davidporter-id-au left a comment

davidporter-id-au Aug 23, 2024

Groxx Aug 27, 2024

davidporter-id-au commented Aug 24, 2024

Groxx commented Aug 27, 2024

Global ratelimiter: improve low-weight low-usage scenarios #6238

Global ratelimiter: improve low-weight low-usage scenarios #6238

Conversation

Groxx commented Aug 19, 2024 • edited Loading

codecov bot commented Aug 19, 2024 • edited Loading

Codecov Report

Groxx Aug 23, 2024

Choose a reason for hiding this comment

davidporter-id-au left a comment

Choose a reason for hiding this comment

davidporter-id-au Aug 23, 2024

Choose a reason for hiding this comment

Groxx Aug 27, 2024

Choose a reason for hiding this comment

davidporter-id-au commented Aug 24, 2024

Groxx commented Aug 27, 2024

Groxx commented Aug 19, 2024 •

edited

Loading

codecov bot commented Aug 19, 2024 •

edited

Loading