Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Reconnecting Issues by Fixing Nick Ident and IRC PONGs #17

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

oxzi
Copy link

@oxzi oxzi commented Oct 20, 2022

Previously, I have experienced reconnecting issues when running the alertmanager-irc-relay as an unauthorized IRC user against a ngIRCd as goirc's state tracking learns an invalid nick ident. This has resulted in an endless reconnection loop. Furthermore, it took round about fifteen minutes to detect a connection loss, as IRC PONGs were not counted. As my "production server" has a somewhat challenging uplink, the alertmanager-irc-relay's connection dies on a weekly basis.

This PR introduces both support to restore the original nick ident before attempting to reconnect and to establish an earlier reconnection based on the absence of PONGs. I had those changes running in "production" for round about a week and could now verify successfully reconnects after a short time. Furthermore, I have tested those changes on my own machine against the same IRC server with a simulated packet or connection loss.

More details are available in the commit messages:

   Restore IRC Ident on Reconnect

   After a connection loss on an IRC session with a ngIRCd, the
   alertmanager-irc-relay was unable to reconnect. After some debugging,
   the error's origin was the state tracking within the used goirc library.
   When using an unidentified session, ngIRCd prefixes the user's ident
   with a `~`. The state tracking registers this and keeps `~${NICK}` as
   the current and the new ident for future reconnects. However, `~` is not
   a valid char for the `<user>` part in the `USER` command, at least not
   for ngIRCd.

   To clarify this behaviour, take a look at the following log. First, the
   initial connection is begin established correctly. Keep an eye on the
   `USER` command being sent to the server.

   > http.go:132: INFO Starting HTTP server
   > irc.go:308: INFO Connected to IRC server, waiting to establish session
   > connection.go:543: DEBUG -> NICK alertbot
   > connection.go:543: DEBUG -> USER alertbot 12 * :Alertmanager IRC Relay
   > connection.go:474: DEBUG <- :__SERVER__ 001 alertbot :Welcome to the Internet Relay Network alertbot!~alertbot@__IP__

   Now, there was a network incident and the session needs to be recreated.

   > connection.go:466: ERROR irc.recv(): read tcp __REDACTED__: read: connection timed out
   > connection.go:577: INFO irc.Close(): Disconnected from server.
   > irc.go:150: INFO Disconnected from IRC
   > reconciler.go:129: INFO Channel #alerts monitor: context canceled while monitoring
   > irc.go:300: INFO Connecting to IRC __SERVER__
   > backoff.go:111: INFO Backoff for 0s starts
   > backoff.go:114: INFO Backoff for 0s ends
   > connection.go:390: INFO irc.Connect(): Connecting to __SERVER__.
   > irc.go:308: INFO Connected to IRC server, waiting to establish session
   > connection.go:543: DEBUG -> NICK alertbot
   > connection.go:543: DEBUG -> USER ~alertbot 12 * :Alertmanager IRC Relay
   > connection.go:474: DEBUG <- ERROR :Invalid user name
   > connection.go:577: INFO irc.Close(): Disconnected from server.
   > irc.go:150: INFO Disconnected from IRC
   > irc.go:319: WARN Receiving a session down before the session is up, this is odd

   This time, the used `user` part of the `USER` command has the prefixed
   `~` and fails. However, without using `-debug` and taking a very close
   look, this error can be missed very easy.

   As the new ident is invalid, the alertmanager-irc-relay is now stuck in
   an endless reconnection loop.

   This fix is kind of straight forward and just checks if the ident has
   changed before trying to reconnect. It might not be the prettiest
   solution, but recreating the whole *irc.Config resulted in other bugs as
   it was still referenced - even after being `Close`d.
    Detect connection loss through IRC PONG

    In the current state, the alertmanager-irc-relay already sends minutely
    IRC PINGs. This allows to check the IRC connection's health in protocol
    without having to deal with specific TCP settings. However, even when
    we are sending those PINGs, we don't process the server's PONGs or their
    absence.

    On one of my alertmanager-irc-relay instances, the time between the last
    received PONG and the TCP read to fail was round about fifteen minutes.
    All this time, the connection was already dead, but there was no attempt
    to reestablish it.

    The introduces changes keep book on the last received PONG and fails if
    there was no new PONG within twice the pingFrequencySecs time. When
    establishing a new connection during the SetupPhase, the current time
    will be set as the last PONG's time to reset the time comparison.

I have rebased my changes against the latest development in the master branch and have successfully verified that they are still working and not breaking anything else.

@shammash
Copy link
Collaborator

Thank you very much for your contribution and for fixing the bugs.

I'd like to make sure we do not accidentally reintroduce these issues in the future.

In order to do so, may I ask you to write a couple of tests in irc_test.go ?

Thank you!

@oxzi
Copy link
Author

oxzi commented Oct 24, 2022

Thanks for your kind review.

I have tried created tests for both use cases and just force-pushed them:

  • TestReconnectNickIdentChange sends back a changed nick ident and checks how the client reacts.
  • TestReconnectMissingPong runs with a minimal PING interval of a second and just waits for the client to reconnect.
  • TestPingPong also runs with a minimal PING interval but supports PING and let the client and server send three PING/PONGs back and forth.

To be able to alter the PING interval, I have refactored the previously constant pingFrequencySecs into a field of the Config struct, as otherwise more in-depth changes were necessary just for those two tests.

Please feel free to take another look. Thanks.

@oxzi
Copy link
Author

oxzi commented Oct 24, 2022

Just force-pushed the second PONG-commit again to fix the TestReconnectMissingPong against a race between goirc's next PING and the PONG timeout when using an interval of only one second. Furthermore, replaced time.Now().Sub(…) by time.Since(…).

@oxzi
Copy link
Author

oxzi commented Nov 6, 2022

I reworked the closing of the IRC connection and re-run the tests successfully within the GitHub Actions, https://github.com/oxzi/alertmanager-irc-relay/actions/runs/3403900480.

However, I did some bulk tests - running multiple go test -v -count=32 and also -count=128 both on Go 1.15 and Go 1.17 - and detected some rare random failures on different tests. For example, on TestConnectErrorRetry, TestKickRejoin, and TestSendAlertDisconnected I experienced timeouts of ten minutes and on other tests sometimes the final QUIT :see ya message was missing. For reference, I also experienced the same timeouts on the current master branch without these changes.

@shammash: Would you like to take a look? Especially as, at least some of those test failures, are not related to my proposed changes.

On another note, I am running those changes on my systems since I opened the PR and it successfully survived outage on both sides, including one downtime and an OS upgrade on the IRC server's end.

@riotbib
Copy link

riotbib commented Dec 22, 2022

This PR is also relevant to me. Thanks a lot to @oxzi; please consider merging.

@flxai
Copy link

flxai commented Jan 24, 2023

@shammash is there something still missing? Read through the PR and it seems fine to me. Looks mergable.

oxzi added 2 commits June 7, 2023 15:39
After a connection loss on an IRC session with a ngIRCd, the
alertmanager-irc-relay was unable to reconnect. After some debugging,
the error's origin was the state tracking within the used goirc library.
When using an unidentified session, ngIRCd prefixes the user's ident
with a `~`. The state tracking registers this and keeps `~${NICK}` as
the current and the new ident for future reconnects. However, `~` is not
a valid char for the `<user>` part in the `USER` command, at least not
for ngIRCd.

To clarify this behaviour, take a look at the following log. First, the
initial connection is begin established correctly. Keep an eye on the
`USER` command being sent to the server.

> http.go:132: INFO Starting HTTP server
> irc.go:308: INFO Connected to IRC server, waiting to establish session
> connection.go:543: DEBUG -> NICK alertbot
> connection.go:543: DEBUG -> USER alertbot 12 * :Alertmanager IRC Relay
> connection.go:474: DEBUG <- :__SERVER__ 001 alertbot :Welcome to the Internet Relay Network alertbot!~alertbot@__IP__

Now, there was a network incident and the session needs to be recreated.

> connection.go:466: ERROR irc.recv(): read tcp __REDACTED__: read: connection timed out
> connection.go:577: INFO irc.Close(): Disconnected from server.
> irc.go:150: INFO Disconnected from IRC
> reconciler.go:129: INFO Channel #alerts monitor: context canceled while monitoring
> irc.go:300: INFO Connecting to IRC __SERVER__
> backoff.go:111: INFO Backoff for 0s starts
> backoff.go:114: INFO Backoff for 0s ends
> connection.go:390: INFO irc.Connect(): Connecting to __SERVER__.
> irc.go:308: INFO Connected to IRC server, waiting to establish session
> connection.go:543: DEBUG -> NICK alertbot
> connection.go:543: DEBUG -> USER ~alertbot 12 * :Alertmanager IRC Relay
> connection.go:474: DEBUG <- ERROR :Invalid user name
> connection.go:577: INFO irc.Close(): Disconnected from server.
> irc.go:150: INFO Disconnected from IRC
> irc.go:319: WARN Receiving a session down before the session is up, this is odd

This time, the used `user` part of the `USER` command has the prefixed
`~` and fails. However, without using `-debug` and taking a very close
look, this error can be missed very easy.

As the new ident is invalid, the alertmanager-irc-relay is now stuck in
an endless reconnection loop.

This fix is kind of straight forward and just checks if the ident has
changed before trying to reconnect. It might not be the prettiest
solution, but recreating the whole *irc.Config resulted in other bugs as
it was still referenced - even after being `Close`d.
In the current state, the alertmanager-irc-relay already sends minutely
IRC PINGs. This allows to check the IRC connection's health in protocol
without having to deal with specific TCP settings. However, even when
we are sending those PINGs, we don't process the server's PONGs or their
absence.

On one of my alertmanager-irc-relay instances, the time between the last
received PONG and the TCP read to fail was round about fifteen minutes.
All this time, the connection was already dead, but there was no attempt
to reestablish it.

The introduces changes keep book on the last received PONG and fails if
there was no new PONG within twice the pingFrequencySecs time. When
establishing a new connection during the SetupPhase, the current time
will be set as the last PONG's time to reset the time comparison.
@oxzi oxzi force-pushed the reconnect-fix branch from 3b3ccb9 to 5aed959 Compare June 7, 2023 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants