-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Reconnecting Issues by Fixing Nick Ident and IRC PONGs #17
base: master
Are you sure you want to change the base?
Conversation
Thank you very much for your contribution and for fixing the bugs. I'd like to make sure we do not accidentally reintroduce these issues in the future. In order to do so, may I ask you to write a couple of tests in irc_test.go ? Thank you! |
Thanks for your kind review. I have tried created tests for both use cases and just force-pushed them:
To be able to alter the PING interval, I have refactored the previously constant Please feel free to take another look. Thanks. |
Just force-pushed the second PONG-commit again to fix the |
I reworked the closing of the IRC connection and re-run the tests successfully within the GitHub Actions, https://github.com/oxzi/alertmanager-irc-relay/actions/runs/3403900480. However, I did some bulk tests - running multiple @shammash: Would you like to take a look? Especially as, at least some of those test failures, are not related to my proposed changes. On another note, I am running those changes on my systems since I opened the PR and it successfully survived outage on both sides, including one downtime and an OS upgrade on the IRC server's end. |
This PR is also relevant to me. Thanks a lot to @oxzi; please consider merging. |
@shammash is there something still missing? Read through the PR and it seems fine to me. Looks mergable. |
After a connection loss on an IRC session with a ngIRCd, the alertmanager-irc-relay was unable to reconnect. After some debugging, the error's origin was the state tracking within the used goirc library. When using an unidentified session, ngIRCd prefixes the user's ident with a `~`. The state tracking registers this and keeps `~${NICK}` as the current and the new ident for future reconnects. However, `~` is not a valid char for the `<user>` part in the `USER` command, at least not for ngIRCd. To clarify this behaviour, take a look at the following log. First, the initial connection is begin established correctly. Keep an eye on the `USER` command being sent to the server. > http.go:132: INFO Starting HTTP server > irc.go:308: INFO Connected to IRC server, waiting to establish session > connection.go:543: DEBUG -> NICK alertbot > connection.go:543: DEBUG -> USER alertbot 12 * :Alertmanager IRC Relay > connection.go:474: DEBUG <- :__SERVER__ 001 alertbot :Welcome to the Internet Relay Network alertbot!~alertbot@__IP__ Now, there was a network incident and the session needs to be recreated. > connection.go:466: ERROR irc.recv(): read tcp __REDACTED__: read: connection timed out > connection.go:577: INFO irc.Close(): Disconnected from server. > irc.go:150: INFO Disconnected from IRC > reconciler.go:129: INFO Channel #alerts monitor: context canceled while monitoring > irc.go:300: INFO Connecting to IRC __SERVER__ > backoff.go:111: INFO Backoff for 0s starts > backoff.go:114: INFO Backoff for 0s ends > connection.go:390: INFO irc.Connect(): Connecting to __SERVER__. > irc.go:308: INFO Connected to IRC server, waiting to establish session > connection.go:543: DEBUG -> NICK alertbot > connection.go:543: DEBUG -> USER ~alertbot 12 * :Alertmanager IRC Relay > connection.go:474: DEBUG <- ERROR :Invalid user name > connection.go:577: INFO irc.Close(): Disconnected from server. > irc.go:150: INFO Disconnected from IRC > irc.go:319: WARN Receiving a session down before the session is up, this is odd This time, the used `user` part of the `USER` command has the prefixed `~` and fails. However, without using `-debug` and taking a very close look, this error can be missed very easy. As the new ident is invalid, the alertmanager-irc-relay is now stuck in an endless reconnection loop. This fix is kind of straight forward and just checks if the ident has changed before trying to reconnect. It might not be the prettiest solution, but recreating the whole *irc.Config resulted in other bugs as it was still referenced - even after being `Close`d.
In the current state, the alertmanager-irc-relay already sends minutely IRC PINGs. This allows to check the IRC connection's health in protocol without having to deal with specific TCP settings. However, even when we are sending those PINGs, we don't process the server's PONGs or their absence. On one of my alertmanager-irc-relay instances, the time between the last received PONG and the TCP read to fail was round about fifteen minutes. All this time, the connection was already dead, but there was no attempt to reestablish it. The introduces changes keep book on the last received PONG and fails if there was no new PONG within twice the pingFrequencySecs time. When establishing a new connection during the SetupPhase, the current time will be set as the last PONG's time to reset the time comparison.
Previously, I have experienced reconnecting issues when running the alertmanager-irc-relay as an unauthorized IRC user against a ngIRCd as goirc's state tracking learns an invalid nick ident. This has resulted in an endless reconnection loop. Furthermore, it took round about fifteen minutes to detect a connection loss, as IRC PONGs were not counted. As my "production server" has a somewhat challenging uplink, the alertmanager-irc-relay's connection dies on a weekly basis.
This PR introduces both support to restore the original nick ident before attempting to reconnect and to establish an earlier reconnection based on the absence of PONGs. I had those changes running in "production" for round about a week and could now verify successfully reconnects after a short time. Furthermore, I have tested those changes on my own machine against the same IRC server with a simulated packet or connection loss.
More details are available in the commit messages:
I have rebased my changes against the latest development in the
master
branch and have successfully verified that they are still working and not breaking anything else.