Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Signalilo Heartbeat Implementation Question #123

Open
dragoangel opened this issue Feb 2, 2023 · 3 comments
Open

Signalilo Heartbeat Implementation Question #123

dragoangel opened this issue Feb 2, 2023 · 3 comments

Comments

@dragoangel
Copy link
Contributor

dragoangel commented Feb 2, 2023

This more question to discuss, but from what I see:

  1. README.md states alert will be in UNKNOWN state if heartbeat will be triggered, but actually it will be in CRITICAL state. I think it was a change in heartbeat service example and someone forgot to update description.
  2. README.md states: On startup, Signalilo checks if the matching heartbeat service is available in Icinga, otherwise it exits with a fatal error. Which get me to understanding that if the heartbeat service doesn't exist 404 or there will be any other failures like 4xx\5xx - Signalilo will die, but I don't see this behavior for now. Maybe it was someday broken?
  3. From my view Signalilo should report if it has issues with writing alerts received from Alertmanager to Icinga in some way. For now it can be broken silently and if nobody checks Signalilo logs - they will not know about it. It could be due to Icinga downtime, or somebody will break something on Icinga side, even basically drop host or API user. There are a couple of different options which can resolve this:
    • Update heartbeat service status if we face errors from Icinga API in some places. I not like such way, as it will be confusing.
    • Work like a proxy between Alertmanager and Icinga, do not reply to Alertmanager with 200 status code till we not get such status code from Icinga. In this way we will know on Alertmanager side that Icinga integration looks like a dead ATM, and "fallback" route could be used to notify about AlertmanagerFailedToSendAlerts|AlertmanagerClusterFailedToSendAlerts. Problem that it will create delays.
    • Last option - I think most preferable: have a separate mandatory service IcingaApiErorrs for such error handling that must be created in the same way as Heartbeat, which will display if there was any errors in last minute. Small minus - with multiple Signalilo replicas it could start to be flapping. In case when even updates of IcingaApiErorrs service fails - Signalilo can instantly reply to Alertmanager about failures. After 1m there will be no errors in Icinga API, as no requests were made, and we will try to update IcingaApiErorrs - if fail - wait 1 minute again and reply to Alertmanager 500, if pass - start accept alerts from Alertmanager.

What you think?

@simu
Copy link
Member

simu commented Feb 3, 2023

  1. README.md states alert will be in UNKNOWN state if heartbeat will be triggered, but actually it will be in CRITICAL state. I think it was a change in heartbeat service example and someone forgot to update description.

That seems to be a mismatch between the example heartbeat service in the README.md and the description. We should update either the heartbeat service example to have

  /* Set the state to UNKNOWN (3) if freshness checks fail. */
  vars.dummy_state = 3

Or update the README.md to state the the heartbeat will go into CRITICAL if no heartbeats are received.

README.md states: On startup, Signalilo checks if the matching heartbeat service is available in Icinga, otherwise it exits with a fatal error. Which get me to understanding that if the heartbeat service doesn't exist 404 or there will be any other failures like 4xx\5xx - Signalilo will die, but I don't see this behavior for now. Maybe it was someday broken?

The internal heartbeat used to work like this in the beginning, but it looks like this got accidentally changed in 3a863dd. It should be fairly straightforward to change

signalilo/serve.go

Lines 102 to 104 in d27c2c8

if err != nil {
s.logger.Errorf("Unable to send initial heartbeat: %v", err)
}
to instead use a Go channel to return the error through startHeartbeat().

From my view Signalilo should report if it has issues with writing alerts received from Alertmanager to Icinga in some way. For now it can be broken silently and if nobody checks Signalilo logs - they will not know about it. It could be due to Icinga downtime, or somebody will break something on Icinga side, even basically drop host or API user. There are a couple of different options which can resolve this:

* Update heartbeat service status if we face errors from Icinga API in some places. I not like such way, as it will be confusing.

* Work like a proxy between Alertmanager and Icinga, do not reply to Alertmanager with 200 status code till we not get such status code from Icinga. In this way we will know on Alertmanager side that Icinga integration looks like a dead ATM, and "fallback" route could be used to notify about `AlertmanagerFailedToSendAlerts|AlertmanagerClusterFailedToSendAlerts`. Problem that it will create delays.

* Last option - I think most preferable: have a separate mandatory service `IcingaApiErorrs` for such error handling that must be created in the same way as `Heartbeat`, which will display if there was any errors in last minute. Small minus - with multiple Signalilo replicas it could start to be flapping. In case when even updates of `IcingaApiErorrs` service fails - Signalilo can instantly reply to Alertmanager about failures. After 1m there will be no errors in Icinga API, as no requests were made, and we will try to update `IcingaApiErorrs` - if fail - wait 1 minute again and reply to Alertmanager 500, if pass - start accept alerts from Alertmanager.

Having such a facility would be nice in general, but at this time we're only maintaining Signalilo and are reviewing external contributions on a best effort basis, since we're moving away from Icinga2 internally, and the need for Signalilo is no longer present for us.

If we want to make a major change to how we handle Signalilo/Icinga liveness, I'd probably go for the second approach you outlined, since both the first and last option suffer from the same issue than the current heartbeat alert, they don't work as you'd want them to if the whole service host or the API user are dropped.

Also, the current heartbeat service should already inform you about most Icinga-side misconfigurations (the one case it won't catch is the dropped Service Host, but that one is hard to catch with anything that's sent to Icinga), since the heartbeat should go critical if the Signalilo API user is missing and Signalilo can't update the heartbeat. Additionally, we could extend the current heartbeat implementation to ensure that we're replying with HTTP 500 to Alertmanager if we couldn't update the heartbeat service -> this would then generate an AlertmanagerFailedToSendAlerts alert which may or may not arrive in Icinga, but you should notice that something is off since the corresponding heartbeat will be CRITICAL or UNKNOWN depending on your configuration.

@dragoangel
Copy link
Contributor Author

Hi @simu, thank you for your reply.

Regarding:

the current heartbeat service should already inform you about most Icinga-side misconfigurations (the one case it won't catch is the dropped Service Host, but that one is hard to catch with anything that's sent to Icinga), since the heartbeat should go critical if the Signalilo API user is missing and Signalilo can't update the heartbeat

Yeah, I think it's true if 3a863dd will be reverted. And I up to:

replying with HTTP 500 to Alertmanager if we couldn't update the heartbeat service

Regarding CRITICAL or UNKNOWN I think CRITICAL for README.md fit better :)

@dragoangel
Copy link
Contributor Author

dragoangel commented Feb 3, 2023

Also regarding:

I'd probably go for the second approach you outlined, since both the first and last option suffer from the same issue than the current heartbeat alert, they don't work as you'd want them to if the whole service host or the API user are dropped.

Third option have 2 sides, it's ready to fully down Icinga Service and will not create delayed requests between Alertmanager-Signalilo-Icinga. The point that it should use same service-template that any other alert so if it fails - it's just report to Alertmanager 500 in async. Basically 3rd option is a mix of 1st and 2nd options, but with async 😁. Heartbeat generally report on Icinga side that Signalilo has a network and basic connectivity + should failures should restart process before accepting alerts - so Alertmanager will not get 200, ErrorRate controls general alert pushing functionality + not creates delays in responding to Alertmanager.

P.s. bad to hear that you moving from Icinga usage, this middleware is nice 👍.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants