Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global ratelimiter, part 2: Any-typed RPCs, mappers, and stub handler #5817

Merged
merged 15 commits into from
Apr 26, 2024

Conversation

Groxx
Copy link
Member

@Groxx Groxx commented Mar 26, 2024

Docs are broadly the same as cadence-workflow/cadence-idl#168, but are repeated here for visibility with some additions at the end for stuff unique to this repo:

Any type:

This API is purely for inter-Cadence-service use, and the Thrift equivalent is in the github.com/uber/cadence-idl repository: cadence-workflow/cadence-idl#168

Design-wise: this has gone through several adjustment and experimenting rounds, and nearly all top-level keys I can come up with are either totally insufficient or totally unnecessary for one ratelimit design or another, so I've leaned more and more towards "define nothing, allow passing anything".

In the extreme, this means: just one Any field. We can add more later if needed / if we discover some kind of truly universal data that is always worth including.
Obviously this is not super ergonomic, but as the ratelimiter is intended to be pluggable by outside implementations it cannot wholly be defined in internal protocol definitions. There needs to be an extendable type of some kind, and arbitrarily requiring e.g. a list<?> or map<string,?> doesn't actually lend any beneficial semantics to the system, nor reduce space on the wire.

Implementers will need to maintain in-process definitions on both ends of the protocol, and check type IDs before decoding. Generally speaking this is probably best done with either a shared protobuf-or-something definition (anywhere) or something schema-free like JSON, so cross-version communication can be done safely e.g. during server upgrades.

Intended use

Eventually this will be part of a pluggable global-ratelimit-data-exchange system, for anything that can make use of the high level "periodically check in to sharded hosts, share data, update limits" pattern that github.com/uber/cadence/common/quotas/global is building up.

The first version will end up using inside-Any structures like this:

type Request struct {
  Caller  string             `json:"caller"`
  Elapsed time.Duration      `json:"elapsed"`
  // a compact map[ratelimit key][allowed, rejected] structure.
  // this could be a regular proto/thrift map<string, struct> instead
  // but that'd have lots of repeated field names in JSON
  Data    map[string][2]int  `json:"data"`
}

type Response struct {
  RPS map[ratelimit key]float64  `json:"rps"`
}

which will probably just use JSON for simplicity for starters.
Or I'll make a new proto struct and shove that in there instead, using proto-encoded data for both thrift and proto transports.
It doesn't particularly matter.

These request/response structures match the API defined in github.com/uber/cadence/common/quotas/global/algorithm/requestweighted.go , and there will be a fairly simple mapper to convert to/from the Any types defined here to make things work.

Why Any instead of strict types?

Because:

  • Types we define in these internally-controlled APIs cannot be modified by anyone running Cadence without maintaining a full fork of multiple repos.
  • Rate-limiting details are assumed to be relatively unique to internal networking setups, and is therefore not something we intend to fix centrally in every possible variation that server-hosts encounter.

So instead this is loosely typed, and the API will allow Cadence-wrapping developers to add wholly externally defined types as needed, and use DynamicConfig to shadow/enable/disable/etc between any registered algorithms. If your company needs to e.g. share custom info from Kubernetes to decide how many requests host X can allow (say it's based on number of cores allocated), you will have a relatively simple way to build that and safely try it out.

Why not google.protobuf.Any?

Covered in code comments, but mostly: because that would require using both protobuf and the same protobuf code-generator we use, so the types can be registered and decoded automatically.
That auto-decoding isn't necessary (nor even particularly useful) though, it can be deferred to the ratelimiter-plugin, and then it can be entirely encoding-agnostic.

So that's what this does. It's not a google.protobuf.Any because it doesn't have to be, and you're free to put a google.protobuf.Any inside your types.Any if yo dawg I heard you like loose typing you want to do so.

Mappers

As the RPC specs and the internal type do essentially no interpreting of their contents, the mappers are both trivial and complete.

To actually use an Any requires custom code per ValueType, which will come in a later PR.

I've written far more mapping-test-code than they really deserve, to try to figure out what some fuzzed round-trip tests might look like for this and other types.
As part of that, I discovered that https://github.com/google/gofuzz has some rather severe issues when generating protobuf types (e.g. it unavoidably panics if it finds any interface fields, which oneof use), and was recently archived, so I think we might need to come up with something else before going too much farther. Maybe just a fork?

Handler

Fully wiring this up will come later, and at a very high level will require:

  • a common/quota wrapper to collect usage metrics in the "limiting" hosts (frontend), which will:
    • periodically, send bulk RatelimitUpdateRequests to "aggregating" hosts (history)
    • use the responses to update quotas as needed
  • dynamicconfig to enable/disable/shadow/roll-back all this
  • relatively simple handler/encoding/decoding code to convert between the "real" types

All of which is essentially out of scope for this PR.

@coveralls
Copy link

coveralls commented Mar 26, 2024

Pull Request Test Coverage Report for Build 018f177a-a26a-4e7b-9e03-fba9286cf461

Details

  • 65 of 65 (100.0%) changed or added relevant lines in 5 files are covered.
  • 9 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.04%) to 67.737%

Files with Coverage Reduction New Missed Lines %
service/matching/taskListManager.go 2 80.65%
service/history/task/fetcher.go 3 85.57%
service/history/task/transfer_standby_task_executor.go 4 86.21%
Totals Coverage Status
Change from base Build 018f1774-0e6f-4579-bdda-cd9b823057df: 0.04%
Covered Lines: 99403
Relevant Lines: 146748

💛 - Coveralls

Copy link

codecov bot commented Mar 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 62.61%. Comparing base (3a50a81) to head (92b5a91).

Additional details and impacted files
Files Coverage Δ
common/types/history.go 0.00% <ø> (ø)
common/types/mapper/proto/history.go 99.22% <100.00%> (+<0.01%) ⬆️
common/types/mapper/proto/shared.go 86.06% <100.00%> (+0.14%) ⬆️
common/types/mapper/thrift/history.go 99.49% <100.00%> (+<0.01%) ⬆️
common/types/mapper/thrift/shared.go 97.62% <100.00%> (+<0.01%) ⬆️
service/history/handler/handler.go 23.34% <100.00%> (+0.12%) ⬆️

... and 4 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3a50a81...92b5a91. Read the comment docs.

@Groxx Groxx changed the title Global ratelimiter, part 2: any-typed RPCs, mappers, and stub handler Global ratelimiter, part 2: Any-typed RPCs, mappers, and stub handler Mar 26, 2024
# Any type:
This API is purely for inter-Cadence-service use, so the Thrift equivalent is in the github.com/uber/cadence-idl repository.

Design-wise: this has gone through several iterations, and nearly all top-level keys I can come up with are either totally insufficient or totally unnecessary for one ratelimit design or another, so I've leaned more and more towards "define nothing, allow passing anything".

Obviously this is not super ergonomic, but as the ratelimiter is intended to be pluggable by outside implementations it cannot *wholly* be defined in internal protocol definitions.
Implementers need to maintain in-process definitions on both ends of the protocol, and check type IDs before decoding.  Generally speaking this is probably best done with either a shared protobuf definition (anywhere) or something schema-free like JSON.

# Mappers
As the RPC specs and the internal type do essentially no interpreting of their contents, the mappers are both trivial and complete.

# Handler
Fully wiring this up will come later, and at a very high level will require:
- a `common/quota` wrapper to collect usage metrics in the "limiting" hosts (frontend)
- periodically, send bulk `RatelimitUpdateRequest`s to "aggregating" hosts (history)
- use the responses to update quotas as needed
- dynamicconfig to enable/disable/shadow/roll-back all this

All of which is essentially out of scope for this PR.
Groxx added a commit to cadence-workflow/cadence-idl that referenced this pull request Mar 26, 2024
…rift (#168)

This API is purely for inter-Cadence-service use, so the Protobuf equivalent is in the github.com/uber/cadence repository (where it is not randomly exposed to clients): cadence-workflow/cadence#5817

Design-wise: this has gone through several adjustment and experimenting rounds, and nearly all top-level keys I can come up with are either totally insufficient or totally unnecessary for one ratelimit design or another, so I've leaned more and more towards "define nothing, allow passing anything".

In the extreme, this means: just one `Any` field.  We can add more later if needed / if we discover some kind of truly universal data that is always worth including.
Obviously this is not super ergonomic, but as the ratelimiter is intended to be pluggable by outside implementations it cannot *wholly* be defined in internal protocol definitions.  There needs to be an extendable type of some kind, and arbitrarily requiring e.g. a `list<?>` or `map<string,?>` doesn't actually lend any beneficial semantics to the system, nor reduce space on the wire.

Implementers will need to maintain in-process definitions on both ends of the protocol, and check type IDs before decoding.  Generally speaking this is probably best done with either a shared protobuf-or-something definition (anywhere) or something schema-free like JSON, so cross-version communication can be done safely e.g. during server upgrades.

# Intended use

Eventually this will be part of a pluggable global-ratelimit-data-exchange system, for anything that can make use of the high level "periodically check in to sharded hosts, share data, update limits" pattern that github.com/uber/cadence/common/quotas/global is building up.

The first version will end up using inside-`Any` structures like this:
```go
type Request struct {
  Caller  string             `json:"caller"`
  Elapsed time.Duration      `json:"elapsed"`
  // a compact map[ratelimit key][allowed, rejected] structure.
  // this could be a regular proto/thrift map<string, struct> instead
  // but that'd have lots of repeated field names in JSON
  Data    map[string][2]int  `json:"data"`
}

type Response struct {
  RPS map[ratelimit key]float64  `json:"rps"`
}
```
which will probably just use JSON for simplicity for starters.
Or I'll make a new proto struct and shove that in there instead, using proto-encoded data for both thrift and proto transports.
It doesn't particularly matter.

These request/response structures match the API defined in github.com/uber/cadence/common/quotas/global/algorithm/requestweighted.go , and there will be a fairly simple mapper to convert to/from the `Any` types defined here to make things work.

# Why `Any` instead of strict types?

Because:
- Types we define in these internally-controlled APIs cannot be modified by anyone running Cadence without maintaining a full fork of multiple repos.
- Rate-limiting details are assumed to be relatively unique to internal networking setups, and is therefore not something we intend to fix centrally in every possible variation that server-hosts encounter.

So instead this is loosely typed, and the API will allow Cadence-wrapping developers to add wholly externally defined types as needed, and use DynamicConfig to shadow/enable/disable/etc between any registered algorithms.  If your company needs to e.g. share custom info from Kubernetes to decide how many requests host X can allow (say it's based on number of cores allocated), you will have a relatively simple way to build that and safely try it out.

# Why not `google.protobuf.Any`?

Covered in code comments, but mostly: because that would _require_ using both protobuf _and_ the same protobuf code-generator we use, so the types can be registered and decoded automatically.
That auto-decoding isn't necessary (nor even particularly useful) though, it can be deferred to the ratelimiter-plugin, and then it can be entirely encoding-agnostic.

So that's what this does.  It's not a `google.protobuf.Any` because it doesn't _have_ to be, and you're free to use it if you want.
@@ -363,3 +365,22 @@ func TestHistoryGetTaskInfoResponse(t *testing.T) {
assert.Equal(t, item, ToHistoryGetFailoverInfoResponse(FromHistoryGetFailoverInfoResponse(item)))
}
}

func TestRatelimitUpdate(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason not to follow table based approach that is used in other tests in this file? They are much shorter and easier to add more cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it'd be a moderate bit larger after accounting for the data in the other files.
but that's not really relevant here, this type probably shouldn't ever gain other fields.

those tests are also simpler because they're only round-trip checks, not "input X produces output Y". bug-for-bug compatibility between mappers won't be caught, nor missed fields (which is why I spent the time to figure out a fuzzing pattern too).

request *types.RatelimitUpdateRequest,
) (*types.RatelimitUpdateResponse, error) {
// TODO: wire up to real global-ratelimit aggregator
return nil, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return unimplemented err

go.work.sum Outdated Show resolved Hide resolved
networkBytes := encode(thriftAny) // yarpc does this
// ^ this is what's sent over the network.

// the final data is not double-encoded, so this "encode -> wrap -> encode" process is reasonably efficient.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it not double-encoded? I may have missed that part

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also via chat but for public purposes...)

semantically: it doesn't matter, no behavior depends on it.

practically: because thrift (and I'm pretty sure proto) store binary blobs as size, data... so there's no need to e.g. escape \0 as \\0 or base64-encode it or other costly stuff. it's just a memcpy.

my main goal here is to point out that anything you'd do with normal thrift is reasonable to do when wrapped by this Any-container because it doesn't meaningfully increase any of the runtime costs. push gigabytes / million-field things through it, it's entirely fine and no different than using the non-wrapped type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained this more in the test comments now btw

Comment on lines -22 to +23
// DO NOT include, tools dependencies are intentionally separate.
// ./internal/tools
// DO NOT include, tools dependencies are intentionally separate.
// ./internal/tools
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go work sync on go 1.20 did this for some reason.

no idea why, but it wants it, and it seems harmless.

Copy link
Member

@davidporter-id-au davidporter-id-au left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, can't fault anything there

@Groxx Groxx merged commit 9cbff9c into cadence-workflow:master Apr 26, 2024
20 checks passed
@Groxx Groxx deleted the ratelimit-rpc branch April 26, 2024 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants