Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix logging on config watcher setup failure #3074

Merged
merged 1 commit into from
Apr 5, 2024

Conversation

skrobul
Copy link
Contributor

@skrobul skrobul commented Mar 19, 2024

We have experienced an odd behaviour where argo-events would attempt to start the eventsource pod, then exit just a few seconds later without producing any meaningful error message. The process exit code was set to 1.

After some investigation and tracing the system calls, I found that the code responsible for this is related to usage of the viper config parsing library. Specifically, argo-events guards against bunch of possible errors while initially reading in the configuration, but does not have necessary error checking for setting up the file watchers. An example of such code can be found here:

v.OnConfigChange(func(e fsnotify.Event) {

viper's WatchConfig() method will attempt to setup a watcher and if it's unsuccesful, will log the error message and exit the process with code 1. The trouble is that, by default the viper uses a discard logger so effectively log message is never actually produced.

ref: https://github.com/spf13/viper/blob/8ac644165cf967d7d5be0cb149eb321c4c8ecfcf/viper.go#L446

An example of such execution in the Pod log files is not particularly easy to troubleshoot.

Before the change:

$ ./argo-events-linux-arm64 eventsource-service
{"level":"info","ts":1710844334.7253304,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"nautobot-webhook","version":"latest+78d47a2.dirty"}
{"level":"info","ts":1710844334.725548,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:454","msg":"Starting event source server...","eventSourceName":"nautobot-webhook"}
$
$ echo $?
1
$

After the change:

$ ./argo-events-linux-arm64 eventsource-service
{"level":"info","ts":1710844214.6256192,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"nautobot-webhook","version":"latest+78d47a2.dirty"}
{"level":"info","ts":1710844214.6260495,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:454","msg":"Starting event source server...","eventSourceName":"nautobot-webhook"}
...
{"time":"2024-03-19T10:30:14.626883973Z","level":"ERROR","msg":"failed to create watcher: too many open files"}
$

This bug can be easily reproduced, ideally in separate VM by artificially lowering the number of allowed inotify instances:

$ sudo sysctl fs.inotify.max_user_instances=0
$ ./argo-events-linux-arm64 eventsource-service
...

@skrobul skrobul requested a review from whynowy as a code owner March 19, 2024 11:40
@skrobul skrobul force-pushed the viper-watcher-bug branch from 5735055 to e443d29 Compare March 19, 2024 12:08
@skrobul skrobul marked this pull request as draft March 19, 2024 12:15
@skrobul
Copy link
Contributor Author

skrobul commented Mar 19, 2024

Just realised that slog is not available in 1.20. I'm going to rework this PR with some alternative solution.

@skrobul skrobul force-pushed the viper-watcher-bug branch 4 times, most recently from 27c4056 to 0c08e74 Compare March 19, 2024 12:47
@skrobul skrobul marked this pull request as ready for review March 19, 2024 13:01
@skrobul skrobul force-pushed the viper-watcher-bug branch from 0c08e74 to 128cb18 Compare March 19, 2024 13:36
Copy link
Member

@whynowy whynowy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix the conflict?

"golang.org/x/exp/slog"
)

func ViperWithLogging() *viper.Viper {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks!

@skrobul skrobul force-pushed the viper-watcher-bug branch from 128cb18 to eeed829 Compare April 1, 2024 13:49
We have experienced an odd behaviour where argo-events would attempt to start
the eventsource pod, then exit just a few seconds later without producing any
meaningful error message. The process exit code was set to 1.

After some investigation and tracing the system calls, I found that the code
responsible for this is related to usage of the `viper` config parsing library.
Specifically, argo-events guards against bunch of possible errors while
initially reading in the configuration, but does not have necessary error
checking for setting up the file watchers. An example of such code can be found
here:

https://github.com/argoproj/argo-events/blob/78d47a2b6e948b9a3fa3572f0c95d8dcf5d7d8ff/eventbus/driver.go#L151

viper's `WatchConfig()` method will attempt to setup a watcher and if it's
unsuccesful, will log the error message and exit the process with code 1. The
trouble is that, by default the viper uses a discard logger so effectively log
message is never actually produced.

ref: https://github.com/spf13/viper/blob/8ac644165cf967d7d5be0cb149eb321c4c8ecfcf/viper.go#L446

An example of such execution in the Pod log files is not particularly easy to
troubleshoot.

Before the change:
```
$ ./argo-events-linux-arm64 eventsource-service
{"level":"info","ts":1710844334.7253304,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"nautobot-webhook","version":"latest+78d47a2.dirty"}
{"level":"info","ts":1710844334.725548,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:454","msg":"Starting event source server...","eventSourceName":"nautobot-webhook"}
$
$ echo $?
1
$
```

After the change:

```
$ ./argo-events-linux-arm64 eventsource-service
{"level":"info","ts":1710844214.6256192,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"nautobot-webhook","version":"latest+78d47a2.dirty"}
{"level":"info","ts":1710844214.6260495,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:454","msg":"Starting event source server...","eventSourceName":"nautobot-webhook"}
...
{"time":"2024-03-19T10:30:14.626883973Z","level":"ERROR","msg":"failed to create watcher: too many open files"}
$
```

This bug can be easily reproduced, ideally in separate VM by artificially
lowering the number of allowed `inotify` instances:

```
$ sudo sysctl fs.inotify.max_user_instances=0
$ ./argo-events-linux-arm64 eventsource-service
...
```

Signed-off-by: Marek Skrobacki <[email protected]>
Signed-off-by: Marek Skrobacki <[email protected]>
@skrobul skrobul force-pushed the viper-watcher-bug branch from eeed829 to d8d25c3 Compare April 1, 2024 13:50
@skrobul
Copy link
Contributor Author

skrobul commented Apr 1, 2024

@whynowy Could you fix the conflict?

All done, thank you!

@skrobul skrobul requested a review from whynowy April 1, 2024 13:57
@whynowy whynowy merged commit 5627811 into argoproj:master Apr 5, 2024
8 checks passed
whynowy pushed a commit that referenced this pull request Jun 14, 2024
Signed-off-by: Marek Skrobacki <[email protected]>
Signed-off-by: Marek Skrobacki <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants