Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase resiliency of logging alerts #389

Closed
3 tasks done
robrap opened this issue Aug 10, 2023 · 3 comments
Closed
3 tasks done

Increase resiliency of logging alerts #389

robrap opened this issue Aug 10, 2023 · 3 comments
Assignees
Labels
event-bus Work related to the Event Bus.

Comments

@robrap
Copy link
Contributor

robrap commented Aug 10, 2023

We had an issue where we were using the following for an event bus alert: SELECT * FROM Log WHERE message RLIKE r'Error producing event to event bus.*' LIMIT MAX, but the error message had presumably changed to: Error delivering message to Kafka event bus. This stopped alerts from firing.

  1. Ensuring there is a custom attribute to look for would be more resilient to these types of changes.
  2. However, logging alerts are preferred because a errors would not be missed due to transaction sampling.
  3. Could a combo of alerts help? Maybe this would result in redundant alerts, or maybe the custom attribute alert would only fire if it detected more errors than the logs?

A/C:

  • The previous event production alerting on New Relic which uses TransactionError should be queried off of an error.class that is custom for producing errors.
  • A new alert on New Relic for consuming TransactionError should be off of a error.class that is custom for event consumption errors.
  • Communication with the Open edX community about possible suggestions to match this pattern.

Implementation Details:

  • Create a new ProducingException(or better name) exception class.
  • Update record_producing_error to reraise the previous error as a ProducingException exception.
  • poll_indefinitely has a call to record_exception that should be refactored to use record_producing_error instead.
  • Create a new ConsumingException (or better name) exception class.
  • Update record_event_consuming_error to reraise the previous error as a ConsumingException error.
    • Update the alert conditions on New Relic once these changes have been deployed.
@robrap robrap added this to Arch-BOM Aug 10, 2023
@robrap robrap converted this from a draft issue Aug 10, 2023
@robrap robrap added the event-bus Work related to the Event Bus. label Aug 10, 2023
@robrap
Copy link
Contributor Author

robrap commented Aug 10, 2023

Labeled with event-bus, but might affect other types of log alerting if we have any.

@rgraber rgraber moved this to Prioritized in Arch-BOM Aug 11, 2023
@rgraber rgraber moved this from Prioritized to On-Call in Arch-BOM Aug 28, 2023
@dianakhuang dianakhuang self-assigned this Aug 28, 2023
@dianakhuang dianakhuang moved this from On-Call to In Progress in Arch-BOM Aug 28, 2023
dianakhuang added a commit to openedx/event-bus-kafka that referenced this issue Aug 28, 2023
In the past, we used generic exceptions to record errors
to New Relic. This creates custom exceptions for producing
and consuming so that we can query New Relic on custom
exceptions instead of relying on log messages.

edx/edx-arch-experiments#389
@dianakhuang
Copy link
Member

I decided against changing the exception handling in poll_indefinitely, because it (a) doesn't have the same context information that the other call has and would therefore be confusing, and (b) since it doesn't reraise the exception, it will still send across the original exception information, which is what we want.

@dianakhuang dianakhuang moved this from In Progress to In Code Review in Arch-BOM Aug 29, 2023
@dianakhuang dianakhuang moved this from In Code Review to Done in Arch-BOM Sep 12, 2023
@robrap
Copy link
Contributor Author

robrap commented Sep 26, 2023

We did complete this ticket as written (I think), but it turns out that we are still using the logging alerts, and those alerts should be made more resilient by using a tag or something other than an error message to be more resilient to change. We may want to create a new ticket for this.

Note: I'm also closing this ticket, which was marked as Done, but not closed.

@robrap robrap closed this as completed Sep 26, 2023
@jristau1984 jristau1984 moved this from Done to Done - Long Term Storage in Arch-BOM Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
event-bus Work related to the Event Bus.
Projects
Status: Done - Long Term Storage
Development

No branches or pull requests

2 participants