fix logging on config watcher setup failure #3074

skrobul · 2024-03-19T11:40:34Z

We have experienced an odd behaviour where argo-events would attempt to start the eventsource pod, then exit just a few seconds later without producing any meaningful error message. The process exit code was set to 1.

After some investigation and tracing the system calls, I found that the code responsible for this is related to usage of the viper config parsing library. Specifically, argo-events guards against bunch of possible errors while initially reading in the configuration, but does not have necessary error checking for setting up the file watchers. An example of such code can be found here:

argo-events/eventbus/driver.go

Line 151 in 78d47a2

v.OnConfigChange(func(e fsnotify.Event) {

viper's WatchConfig() method will attempt to setup a watcher and if it's unsuccesful, will log the error message and exit the process with code 1. The trouble is that, by default the viper uses a discard logger so effectively log message is never actually produced.

ref: https://github.com/spf13/viper/blob/8ac644165cf967d7d5be0cb149eb321c4c8ecfcf/viper.go#L446

An example of such execution in the Pod log files is not particularly easy to troubleshoot.

Before the change:

$ ./argo-events-linux-arm64 eventsource-service
{"level":"info","ts":1710844334.7253304,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"nautobot-webhook","version":"latest+78d47a2.dirty"}
{"level":"info","ts":1710844334.725548,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:454","msg":"Starting event source server...","eventSourceName":"nautobot-webhook"}
$
$ echo $?
1
$

After the change:

$ ./argo-events-linux-arm64 eventsource-service
{"level":"info","ts":1710844214.6256192,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"nautobot-webhook","version":"latest+78d47a2.dirty"}
{"level":"info","ts":1710844214.6260495,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:454","msg":"Starting event source server...","eventSourceName":"nautobot-webhook"}
...
{"time":"2024-03-19T10:30:14.626883973Z","level":"ERROR","msg":"failed to create watcher: too many open files"}
$

This bug can be easily reproduced, ideally in separate VM by artificially lowering the number of allowed inotify instances:

$ sudo sysctl fs.inotify.max_user_instances=0
$ ./argo-events-linux-arm64 eventsource-service
...

skrobul · 2024-03-19T12:30:33Z

Just realised that slog is not available in 1.20. I'm going to rework this PR with some alternative solution.

whynowy

Could you fix the conflict?

whynowy · 2024-03-31T17:41:55Z

common/viper.go

+	"golang.org/x/exp/slog"
+)
+
+func ViperWithLogging() *viper.Viper {


This is great, thanks!

We have experienced an odd behaviour where argo-events would attempt to start the eventsource pod, then exit just a few seconds later without producing any meaningful error message. The process exit code was set to 1. After some investigation and tracing the system calls, I found that the code responsible for this is related to usage of the `viper` config parsing library. Specifically, argo-events guards against bunch of possible errors while initially reading in the configuration, but does not have necessary error checking for setting up the file watchers. An example of such code can be found here: https://github.com/argoproj/argo-events/blob/78d47a2b6e948b9a3fa3572f0c95d8dcf5d7d8ff/eventbus/driver.go#L151 viper's `WatchConfig()` method will attempt to setup a watcher and if it's unsuccesful, will log the error message and exit the process with code 1. The trouble is that, by default the viper uses a discard logger so effectively log message is never actually produced. ref: https://github.com/spf13/viper/blob/8ac644165cf967d7d5be0cb149eb321c4c8ecfcf/viper.go#L446 An example of such execution in the Pod log files is not particularly easy to troubleshoot. Before the change: ``` $ ./argo-events-linux-arm64 eventsource-service {"level":"info","ts":1710844334.7253304,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"nautobot-webhook","version":"latest+78d47a2.dirty"} {"level":"info","ts":1710844334.725548,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:454","msg":"Starting event source server...","eventSourceName":"nautobot-webhook"} $ $ echo $? 1 $ ``` After the change: ``` $ ./argo-events-linux-arm64 eventsource-service {"level":"info","ts":1710844214.6256192,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"nautobot-webhook","version":"latest+78d47a2.dirty"} {"level":"info","ts":1710844214.6260495,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:454","msg":"Starting event source server...","eventSourceName":"nautobot-webhook"} ... {"time":"2024-03-19T10:30:14.626883973Z","level":"ERROR","msg":"failed to create watcher: too many open files"} $ ``` This bug can be easily reproduced, ideally in separate VM by artificially lowering the number of allowed `inotify` instances: ``` $ sudo sysctl fs.inotify.max_user_instances=0 $ ./argo-events-linux-arm64 eventsource-service ... ``` Signed-off-by: Marek Skrobacki <[email protected]> Signed-off-by: Marek Skrobacki <[email protected]>

skrobul · 2024-04-01T13:57:45Z

@whynowy Could you fix the conflict?

All done, thank you!

Signed-off-by: Marek Skrobacki <[email protected]> Signed-off-by: Marek Skrobacki <[email protected]>

skrobul requested a review from whynowy as a code owner March 19, 2024 11:40

skrobul force-pushed the viper-watcher-bug branch from 5735055 to e443d29 Compare March 19, 2024 12:08

skrobul marked this pull request as draft March 19, 2024 12:15

skrobul force-pushed the viper-watcher-bug branch 4 times, most recently from 27c4056 to 0c08e74 Compare March 19, 2024 12:47

skrobul marked this pull request as ready for review March 19, 2024 13:01

skrobul force-pushed the viper-watcher-bug branch from 0c08e74 to 128cb18 Compare March 19, 2024 13:36

whynowy approved these changes Mar 31, 2024

View reviewed changes

skrobul force-pushed the viper-watcher-bug branch from 128cb18 to eeed829 Compare April 1, 2024 13:49

skrobul force-pushed the viper-watcher-bug branch from eeed829 to d8d25c3 Compare April 1, 2024 13:50

skrobul requested a review from whynowy April 1, 2024 13:57

whynowy merged commit 5627811 into argoproj:master Apr 5, 2024
8 checks passed

whynowy pushed a commit that referenced this pull request Jun 14, 2024

fix logging on config watcher setup failure (#3074)

cf8736e

Signed-off-by: Marek Skrobacki <[email protected]> Signed-off-by: Marek Skrobacki <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix logging on config watcher setup failure #3074

fix logging on config watcher setup failure #3074

skrobul commented Mar 19, 2024

skrobul commented Mar 19, 2024

whynowy left a comment

whynowy Mar 31, 2024

skrobul commented Apr 1, 2024

fix logging on config watcher setup failure #3074

fix logging on config watcher setup failure #3074

Conversation

skrobul commented Mar 19, 2024

skrobul commented Mar 19, 2024

whynowy left a comment

Choose a reason for hiding this comment

whynowy Mar 31, 2024

Choose a reason for hiding this comment

skrobul commented Apr 1, 2024