Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle timeouts possible in Khepri minority in rabbit_db_exchange #11785

Merged
merged 8 commits into from
Jul 24, 2024

Conversation

the-mikedavis
Copy link
Member

Continuation of the changes like in #11685 and #11706 but for exchanges.

The changes for exchanges are a bit more involved than prior PRs. Here's a summary of the changes:

  • rabbit_exchange:declare/7's spec is updated to return {ok, #exchange{}} | {error, timeout} where it returned just #exchange{} before
    • this is safe to change since the function is only ever called through RPC in the CLI's test helpers
  • Callers of rabbit_exchange:delete/3 need some small changes to handle an {error, timeout} return
    • I've added a rabbit_exchange:ensure_deleted/3 wrapper that covers the common case for callers where ok | {error, not_found} are handled the same way.
  • Set timeout => infinity for calls for updating the exchange serial
    • Serial updates are always made after another database operation (for example adding or deleting an exchange)
    • Ideally we would update the serial in a transaction but this would take a larger change, so I've left it for future work.

A common case for exchange deletion is that callers want the deletion
to be idempotent: they treat the `ok` and `{error, not_found}` returns
from `rabbit_exchange:delete/3` the same way. To simplify these
callsites we add a `rabbit_exchange:ensure_deleted/3` that wraps
`rabbit_exchange:delete/3` and returns `ok` when the exchange did not
exist. Part of this commit is to update callsites to use this helper.

The other part is to handle the `rabbit_khepri:timeout()` error possible
when Khepri is in a minority. For most callsites this is just a matter
of adding a branch to their `case` clauses and an appropriate error and
message.
It's unlikely that these operations will time out since the serial
number is always updated after some other transaction, for example
adding or deleting an exchange.

In the future we could consider moving the serial updates into those
transactions. In the meantime we can remove the possibility of timeouts
by giving the serial update unlimited time to finish.
The spec of `rabbit_exchange:declare/7` needs to be updated to return
`{ok, Exchange} | {error, Reason}` instead of the old return value of
`rabbit_types:exchange()`. This is safe to do since `declare/7` is not
called by RPC - from the CLI or otherwise - outside of test suites, and
in test suites only through the CLI's `TestHelper.declare_exchange/7`.
Callers of this helper are updated in this commit.

Otherwise this commit updates callers to unwrap the `{ok, Exchange}`
and bubble up errors.
@the-mikedavis the-mikedavis self-assigned this Jul 22, 2024
@the-mikedavis the-mikedavis marked this pull request as ready for review July 24, 2024 14:42
`rabbit_amqp_management` returns HTTP status codes to the client. 503
means that a service is unavailable (which Khepri is while it is in a
minority) so it's a more appropriate code than the generic 500
internal server error.
@the-mikedavis the-mikedavis force-pushed the md/khepri-minority-errors/rabbit_db_exchange branch 2 times, most recently from 2bf9298 to 2259f84 Compare July 24, 2024 15:27
This fixes a potential crash in `rabbit_amqp_amanegment` where we tried
to format the exchange resource as a string (`~ts`). The other changes
are cosmetic.
@the-mikedavis the-mikedavis force-pushed the md/khepri-minority-errors/rabbit_db_exchange branch from 2259f84 to b56abee Compare July 24, 2024 15:32
@the-mikedavis the-mikedavis merged commit 4207faf into main Jul 24, 2024
191 checks passed
@the-mikedavis the-mikedavis deleted the md/khepri-minority-errors/rabbit_db_exchange branch July 24, 2024 17:11
michaelklishin added a commit that referenced this pull request Jul 24, 2024
Handle timeouts possible in Khepri minority in `rabbit_db_exchange` (backport #11785)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants