Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Producer] Discovery: Sad path for event production #54

Closed
2 tasks
robrap opened this issue Aug 10, 2022 · 8 comments
Closed
2 tasks

[Producer] Discovery: Sad path for event production #54

robrap opened this issue Aug 10, 2022 · 8 comments
Assignees
Labels
event-bus Work related to the Event Bus.

Comments

@robrap
Copy link
Contributor

robrap commented Aug 10, 2022

Discovery for error handling of event production may result in an implementation (or POC branches), and/or documentation and further ticketing.

  • Document answers to questions, or what questions we will defer until future events are implemented with other requirements. Some questions may be fully answered, others may be deferred (and left in doc and/or ticketed as appropriate to not leave a backlog of an infinite number of tickets we'll never get to idea.)
  • Discuss implication of decisions with TNL, who owns the Studio event publish. This may result in future tickets. We want to ensure it is in a place where everyone feels comfortable with a hand-off at some point.

We want to explore the space of 'things that could go wrong with event production' and this ticket for enumerating the ones we know about.

  • Add more questions and answers.
  • Consider distributed nature questions, and not just error handling.
  • CAP Theorem

Questions:

  • Are there Kafka vs application errors, and will these be handled differently?
  • What do we do if we can't publish an event that allows us to recover later? How do we recover?
  • see [Producer] Make sure we don't lose events on producer shutdown openedx/event-bus-kafka#11: what do we do to ensure events are handled on server shutdown?
  • What kind of reliability guarantees do we want?
  • How will we discover consistency failures and/or unacceptably high latency of some events?
    • Might encounter exception when trying to produce an event after a transaction completes (in on-commit)
    • Event might fail to send, but is later rectified because a newer version of the data is successfully sent
    • Production failures could be intermittent (random 1 in 1000) or in big groups (e.g. Kafka is down or unreachable for 20 minutes)
    • Also possible to produce counterfactual events that need to be undone/overwritten (successful send inside a transaction that is then rolled back) although we really should be using on-commit for all of this so we err on the side of missing events.
  • If we discover a consistency failure, how will we recover? Will each producer need a management command to re-produce events for a given set of keys? (Will our initial startup sync/migration/cutover process use the same mechanism?)
@robrap robrap changed the title Discovery: Error handing for even production Discovery: Error handling for event production Aug 10, 2022
@robrap robrap added this to the [Event Bus] Implement sad-path milestone Aug 10, 2022
@robrap robrap moved this to Todo in Arch-BOM Aug 10, 2022
@dianakhuang
Copy link
Member

Some discussion of handling shutdown of producer can be found: openedx/event-bus-kafka#11

@dianakhuang dianakhuang changed the title Discovery: Error handling for event production Discovery: Sad path for event production Aug 25, 2022
@robrap robrap changed the title Discovery: Sad path for event production [Producer] Discovery: Sad path for event production Aug 25, 2022
@rgraber rgraber moved this from Todo to In Progress in Arch-BOM Sep 6, 2022
@rgraber rgraber self-assigned this Sep 6, 2022
@rgraber
Copy link
Contributor

rgraber commented Sep 6, 2022

@rgraber
Copy link
Contributor

rgraber commented Sep 6, 2022

At the very least, we're probably going to have to consider three different types of issues:

  1. We can't send anything to any topic until we restart the server (we misconfigured something important or otherwise have busted producer code)
  2. We can't send anything to any topic until some unspecified time (Confluent is Just Down or someone needs to go into the Confluent UI and fix a permission or something). There's a risk the server will be restarted within this time.
  3. We can't send some kinds of messages or messages to certain topics but we can send others (a serializer is messed up, or we get an unexpected value, or a topic permission specifically is missing).

1/2 are probably harder, since really the only safe thing to do is to have some sort of persistent storage (like a DB table) that keeps track of events that happened but never made it to the event bus at all. How this is implemented would probably vary wildly between services.

3 is probably where we can actually be the most helpful, and where things like retry and DLQ topics come in. This could also be an iterative process that could be changed as we get more concrete use cases.

For example, one very very simple first shot would be to create a DLQ topic alongside every actual topic. The DLQ topic would have to have very open permissions and a super loose schema (maybe something like {'source':<where I'm coming from>, 'event_key_as_string':a_long_string, 'event_value_as_string':a_very_long_string'}.
Any time we get a serialization error or a missing-permission-for-topic error, we smash the event into the simple serialized form above and send it to the DLQ. It would then be up to any client that cares about the event to write a process to consume from the DLQ and do whatever it needs to do.

@rgraber
Copy link
Contributor

rgraber commented Sep 7, 2022

For 1/2, we might be able to help out with a standardized log format that would at least allow anyone to grep for these kinds of events and be confident that they found all of them in Splunk. It's not great but it's a step.

@rgraber
Copy link
Contributor

rgraber commented Sep 8, 2022

Ex of a first shot at 3: openedx/event-bus-kafka#43

@robrap
Copy link
Contributor Author

robrap commented Sep 28, 2022

I will do more of a review of these notes and PRs, but just wanted to write some notes that have been bouncing around in my brain.

Capturing (potentially premature) assorted thoughts:

  1. For any solution, can we answer whether or not any events were dropped? Can we monitor for dropped events, or only investigate if there are suspicions?
  2. At what point can we have an id for each event that can be carried through on the event, and added to logging or monitoring? Note that an OEP requires a CloudEvent id for the event.
  3. We could note the Outbox pattern as a possibility: https://medium.com/contino-engineering/publishing-events-to-kafka-using-a-outbox-pattern-867a48e29d35. (This may not be the best article, but it is an article. This is also covered in the book we partially read for the book club on eventing.)
  4. Imagining that event Production and event Consumption could (and likely should) be owned by separate teams, how will this all related to any guarantees or documentation that the event Producer will make for any consumer.

@robrap
Copy link
Contributor Author

robrap commented Sep 30, 2022

More thoughts: :)

  1. I assume that for most of our business events, we will not want any dropped events.
  2. If we owned the publishing of this event, what would be our runbook for dropped events? (Would we try to resend an old event late? Would we try to duplicate the change to send an updated event? Other?)
  3. Options for capturing dropped events include log file, DB, Kafka, monitoring tools, other? We may want multiple (e.g. log + DB). How will we track what failed events have been dealt with? What is the minimum number of tools to use/watch to make this all work?

@rgraber rgraber moved this from In Progress to Done in Arch-BOM Oct 11, 2022
@rgraber
Copy link
Contributor

rgraber commented Oct 25, 2022

Closing this while we're going with logging

@rgraber rgraber closed this as completed Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
event-bus Work related to the Event Bus.
Projects
Archived in project
Development

No branches or pull requests

3 participants