[Producer] Discovery: Sad path for event production #54

robrap · 2022-08-10T14:07:13Z

Discovery for error handling of event production may result in an implementation (or POC branches), and/or documentation and further ticketing.

Document answers to questions, or what questions we will defer until future events are implemented with other requirements. Some questions may be fully answered, others may be deferred (and left in doc and/or ticketed as appropriate to not leave a backlog of an infinite number of tickets we'll never get to idea.)
Discuss implication of decisions with TNL, who owns the Studio event publish. This may result in future tickets. We want to ensure it is in a place where everyone feels comfortable with a hand-off at some point.

We want to explore the space of 'things that could go wrong with event production' and this ticket for enumerating the ones we know about.

Add more questions and answers.
Consider distributed nature questions, and not just error handling.
CAP Theorem

Questions:

Are there Kafka vs application errors, and will these be handled differently?
What do we do if we can't publish an event that allows us to recover later? How do we recover?
see [Producer] Make sure we don't lose events on producer shutdown openedx/event-bus-kafka#11: what do we do to ensure events are handled on server shutdown?
What kind of reliability guarantees do we want?
How will we discover consistency failures and/or unacceptably high latency of some events?
- Might encounter exception when trying to produce an event after a transaction completes (in on-commit)
- Event might fail to send, but is later rectified because a newer version of the data is successfully sent
- Production failures could be intermittent (random 1 in 1000) or in big groups (e.g. Kafka is down or unreachable for 20 minutes)
- Also possible to produce counterfactual events that need to be undone/overwritten (successful send inside a transaction that is then rolled back) although we really should be using on-commit for all of this so we err on the side of missing events.
If we discover a consistency failure, how will we recover? Will each producer need a management command to re-produce events for a given set of keys? (Will our initial startup sync/migration/cutover process use the same mechanism?)

dianakhuang · 2022-08-25T15:59:35Z

Some discussion of handling shutdown of producer can be found: openedx/event-bus-kafka#11

rgraber · 2022-09-06T19:28:15Z

Useful suggestions: https://www.confluent.io/blog/error-handling-patterns-in-kafka/

rgraber · 2022-09-06T21:06:37Z

At the very least, we're probably going to have to consider three different types of issues:

We can't send anything to any topic until we restart the server (we misconfigured something important or otherwise have busted producer code)
We can't send anything to any topic until some unspecified time (Confluent is Just Down or someone needs to go into the Confluent UI and fix a permission or something). There's a risk the server will be restarted within this time.
We can't send some kinds of messages or messages to certain topics but we can send others (a serializer is messed up, or we get an unexpected value, or a topic permission specifically is missing).

1/2 are probably harder, since really the only safe thing to do is to have some sort of persistent storage (like a DB table) that keeps track of events that happened but never made it to the event bus at all. How this is implemented would probably vary wildly between services.

3 is probably where we can actually be the most helpful, and where things like retry and DLQ topics come in. This could also be an iterative process that could be changed as we get more concrete use cases.

For example, one very very simple first shot would be to create a DLQ topic alongside every actual topic. The DLQ topic would have to have very open permissions and a super loose schema (maybe something like {'source':<where I'm coming from>, 'event_key_as_string':a_long_string, 'event_value_as_string':a_very_long_string'}.
Any time we get a serialization error or a missing-permission-for-topic error, we smash the event into the simple serialized form above and send it to the DLQ. It would then be up to any client that cares about the event to write a process to consume from the DLQ and do whatever it needs to do.

rgraber · 2022-09-07T12:12:18Z

For 1/2, we might be able to help out with a standardized log format that would at least allow anyone to grep for these kinds of events and be confident that they found all of them in Splunk. It's not great but it's a step.

rgraber · 2022-09-08T14:26:29Z

Ex of a first shot at 3: openedx/event-bus-kafka#43

robrap · 2022-09-28T20:31:50Z

I will do more of a review of these notes and PRs, but just wanted to write some notes that have been bouncing around in my brain.

Capturing (potentially premature) assorted thoughts:

For any solution, can we answer whether or not any events were dropped? Can we monitor for dropped events, or only investigate if there are suspicions?
At what point can we have an id for each event that can be carried through on the event, and added to logging or monitoring? Note that an OEP requires a CloudEvent id for the event.
We could note the Outbox pattern as a possibility: https://medium.com/contino-engineering/publishing-events-to-kafka-using-a-outbox-pattern-867a48e29d35. (This may not be the best article, but it is an article. This is also covered in the book we partially read for the book club on eventing.)
Imagining that event Production and event Consumption could (and likely should) be owned by separate teams, how will this all related to any guarantees or documentation that the event Producer will make for any consumer.

robrap · 2022-09-30T14:06:23Z

More thoughts: :)

I assume that for most of our business events, we will not want any dropped events.
If we owned the publishing of this event, what would be our runbook for dropped events? (Would we try to resend an old event late? Would we try to duplicate the change to send an updated event? Other?)
Options for capturing dropped events include log file, DB, Kafka, monitoring tools, other? We may want multiple (e.g. log + DB). How will we track what failed events have been dealt with? What is the minimum number of tools to use/watch to make this all work?

rgraber · 2022-10-25T19:25:30Z

Closing this while we're going with logging

robrap changed the title ~~Discovery: Error handing for even production~~ Discovery: Error handling for event production Aug 10, 2022

robrap added this to the [Event Bus] Implement sad-path milestone Aug 10, 2022

robrap moved this to Todo in Arch-BOM Aug 10, 2022

dianakhuang changed the title ~~Discovery: Error handling for event production~~ Discovery: Sad path for event production Aug 25, 2022

robrap changed the title ~~Discovery: Sad path for event production~~ [Producer] Discovery: Sad path for event production Aug 25, 2022

rgraber moved this from Todo to In Progress in Arch-BOM Sep 6, 2022

rgraber self-assigned this Sep 6, 2022

robrap mentioned this issue Sep 29, 2022

Discovery for Studio event bus use case #25

Closed

7 tasks

rgraber moved this from In Progress to Done in Arch-BOM Oct 11, 2022

rgraber closed this as completed Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Producer] Discovery: Sad path for event production #54

[Producer] Discovery: Sad path for event production #54

robrap commented Aug 10, 2022 •

edited

Loading

dianakhuang commented Aug 25, 2022

rgraber commented Sep 6, 2022

rgraber commented Sep 6, 2022

rgraber commented Sep 7, 2022

rgraber commented Sep 8, 2022

robrap commented Sep 28, 2022

robrap commented Sep 30, 2022

rgraber commented Oct 25, 2022

[Producer] Discovery: Sad path for event production #54

[Producer] Discovery: Sad path for event production #54

Comments

robrap commented Aug 10, 2022 • edited Loading

dianakhuang commented Aug 25, 2022

rgraber commented Sep 6, 2022

rgraber commented Sep 6, 2022

rgraber commented Sep 7, 2022

rgraber commented Sep 8, 2022

robrap commented Sep 28, 2022

robrap commented Sep 30, 2022

rgraber commented Oct 25, 2022

robrap commented Aug 10, 2022 •

edited

Loading