old critical alerts in icinga do not go away after upgrade of openshift #26

wahabz · 2020-05-15T21:16:22Z

First of all, great product: signalilo.
I recently set this up for our OpenShift clusters.

We had the following scenerio:
For our OpenShift Cluster A, we had bunch of critical alerts that showed up in Icinga.
Those alerts were not resolved (as in from OpenShift side).
We did an upgrade on our OpenShift Cluster, and after that re-added in the webhook config in alertmanager.
So from alertmanager perspective, it is now brand new. So old alerts in icinga were not resolved (they never got the resolved notification from alertmanager via signalilo).
Now in Icinga, we have this OpenShift Cluster set up as a Host "Test Host", and although new alerts are coming in and are resolved, the old alerts from previous version of OpenShift are still there.

I understand that there is a SIGNALILO_ICINGA_KEEP_FOR setting, but that is for OK and or resolved alerts.

I think that there should be a criteria such that if the alert is no longer firing from AlertManager, and if there are some lingering critical services in Icinga which did not receive any resolved status, then those should be garbage collected as well.

simu · 2020-05-18T10:01:03Z

Thanks for the feedback.

We have considered different options for handling stale alerts in Icinga, but it's hard to implement a solution that's correct for arbitrary resend intervals in Alertmanager, since Signalilo cannot really distinguish between a critical alert with a high repeat interval and a stale alert, especially since Signalilo does not keep any local state.

One possibility would be to adjust the Icinga checks to be active with a recheck interval that's somehow derived from the repeat interval of the alert in Alertmanager. However the value of the resend interval would have to be provided to Signalilo as an extra configuration value, as it's not available in the received alerts (Side-note: potentially the endsAt field of the received alert can be used, needs further investigation).

In the meantime, what you can to do clean up stale alerts, is to click "check now" in Icinga, which sets the alert status to OK (as dummy_state is set to 0 for Icinga checks created by Signalilo), and makes the check eligible for garbage collection according to the value of SIGNALILO_ICINGA_KEEP_FOR.

wahabz · 2020-05-26T22:17:06Z

I got some time to review your comment regarding this.
So I was thinking if it is possible to Garbage Collect all the alerts in Icinga and only leave out the one's that are firing?
This way, say alert A, B and C fires and signalilo captures them, reports them in icinga. Now, say after an hour or alert A is still firing, whereas alert B and C have stopped firing but missed sending resolved (could be that alertmanager is upgraded or something else). In this case, when signalilo does garbage collection, it should first look at the firing alerts (A in this case), and collect all the alerts/services in icinga that are not OK.

simu · 2020-06-29T16:25:55Z

At the time Signalilo performs garbage collection, we do not know which alerts are firing, since we do not keep any local state about alerts in Signalilo. Therefore we cannot just look at the firing alerts and GC all alerts which are not firing anymore, as we simply don't have information to determine which alerts are still firing when GC runs.

I'm leaning towards the solution of using the Alertmanager resend interval, as an additional configuration value that needs to be provided to Signalilo, with a reasonably high default, to create active Icinga2 services. Those services should be checked with roughly the same frequency as Alertmanager resends the alerts. Note that the check interval in Icinga should be a bit higher than the resend interval to allow for some network latency.

Since we already implement active checks for "heartbeat" alerts, this change should be doable.

Xavier-0965 · 2022-10-11T06:32:49Z

We have also the problem sometimes. But it seems difficult to reproduce:
This morning we have updated one Openshift cluster (to version 4.10.35: prometheus version 2.32.1, Alertmanager Version 0.23.0).
I had an alert before the update, to try to reproduce the problem.
But after the update, that alert is correctly bound to one service in icinga.
So I didn't reproduce the problem. But I document it, maybe this helps for the analysis.

What I have seen, is that as soon a firing alert is seen, Signalilo computes a serviceName (see

signalilo/webhook/icinga.go

Line 52 in 1ebf0f3

func computeServiceName(

) using the UUID and the sorted labels of the alert. It checks in Icinga if that service is found.
If it is not the case, a new service is created in Icinga.
Otherwise the service is updated.

Maybe there are cases, where the labels are changed?

wahabz changed the title ~~critical alerts in icinga do not go away~~ old critical alerts in icinga do not go away after upgrade of openshift May 15, 2020

mc-meta mentioned this issue Sep 3, 2021

Expose additional Icinga service parameters as tunables #90

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

old critical alerts in icinga do not go away after upgrade of openshift #26

old critical alerts in icinga do not go away after upgrade of openshift #26

wahabz commented May 15, 2020 •

edited

Loading

simu commented May 18, 2020

wahabz commented May 26, 2020

simu commented Jun 29, 2020

Xavier-0965 commented Oct 11, 2022

old critical alerts in icinga do not go away after upgrade of openshift #26

old critical alerts in icinga do not go away after upgrade of openshift #26

Comments

wahabz commented May 15, 2020 • edited Loading

simu commented May 18, 2020

wahabz commented May 26, 2020

simu commented Jun 29, 2020

Xavier-0965 commented Oct 11, 2022

wahabz commented May 15, 2020 •

edited

Loading