Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo Sensors pods die when the EventBus leader pod is killed #2376

Closed
vgazzola opened this issue Dec 23, 2022 · 6 comments · Fixed by #2839
Closed

Argo Sensors pods die when the EventBus leader pod is killed #2376

vgazzola opened this issue Dec 23, 2022 · 6 comments · Fixed by #2839
Labels

Comments

@vgazzola
Copy link

We encounter some instability with Argo Events.

Versions and setup

  • Argo events helm chart v.2.0.9
  • 3 Eventbus pods, using JetStream version 2.8.2

When the EventBus leader is killed, our sensor dies with errors like:
2022-12-22T15:30:37.963Z INFO argo-events.sensor base/jetstream.go:102 Connected to NATS Jetstream server. {"sensorName": "nats-sensor-infra-present"} 2022-12-22T15:30:42.940Z ERROR argo-events.sensor sensor/trigger_conn.go:78 failed to get K/V store for sensor nats-sensor-infra-present: context deadline exceeded {"sensorName": "nats-sensor-infra-present", "triggerName": "app-trigger"} github.com/argoproj/argo-events/eventbus/jetstream/sensor.NewJetstreamTriggerConn /home/runner/work/argo-events/argo-events/eventbus/jetstream/sensor/trigger_conn.go:78 github.com/argoproj/argo-events/eventbus/jetstream/sensor.(*SensorJetstream).Connect /home/runner/work/argo-events/argo-events/eventbus/jetstream/sensor/sensor_jetstream.go:77 github.com/argoproj/argo-events/sensors.(*SensorContext).listenEvents.func2 /home/runner/work/argo-events/argo-events/sensors/listener.go:292 2022-12-22T15:30:42.940Z ERROR argo-events.sensor sensors/listener.go:294 failed to reconnect to eventbus {"sensorName": "nats-sensor-infra-present", "triggerName": "app-trigger", "connection": "", "error": "failed to get K/V store for sensor nats-sensor-infra-present: context deadline exceeded"} github.com/argoproj/argo-events/sensors.(*SensorContext).listenEvents.func2 /home/runner/work/argo-events/argo-events/sensors/listener.go:294 panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1a94360]

I saw the somewhat similar issue
#2086
and the related change, but I'm not certain it will fix our issue.

@whynowy
Copy link
Member

whynowy commented Jan 4, 2023

What is the version of Argo Events (sorry the helm chart is maintained by community users)?

@vgazzola
Copy link
Author

vgazzola commented Jan 9, 2023

The latest I've tried is 1.7.4
We had the same behaviour with 1.7.1 and 1.7.3

@whynowy
Copy link
Member

whynowy commented Jan 11, 2023

r.go:294

Did you use persistence for the EventBus?
After the eventbus pod is killed, did it come back successfully?

@vgazzola
Copy link
Author

vgazzola commented Jan 13, 2023

Did you use persistence for the EventBus?

Yes we do

After the eventbus pod is killed, did it come back successfully?

Yes it does.
The sensor usually also comes back, but sometimes doesn't

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had
any activity in the last 60 days. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 15, 2023
@whynowy whynowy added pinned and removed stale labels Mar 15, 2023
@gokulav137
Copy link
Contributor

gokulav137 commented Sep 24, 2023

Looks like the bug is in this line: https://github.com/argoproj/argo-events/blob/master/sensors/listener.go#L301

if conn == nil || conn.IsClosed() {

here conn is of type eventbuscommon.TriggerConnection which is an interface so nil comparison won't check if the interface value is nil. So conn.IsClosed() could be called on a nil value. Which is what is causing the error.

2023-09-23T15:04:39.436Z        ERROR   argo-events.sensor      sensor/trigger_conn.go:78       failed to get K/V store for sensor webhook: nats: stream is offline  {"sensorName": "webhook", "triggerName": "webhook-email-trigger", "sensorName": "webhook"}
github.com/argoproj/argo-events/eventbus/jetstream/sensor.NewJetstreamTriggerConn
        /home/runner/work/argo-events/argo-events/eventbus/jetstream/sensor/trigger_conn.go:78
github.com/argoproj/argo-events/eventbus/jetstream/sensor.(*SensorJetstream).Connect
        /home/runner/work/argo-events/argo-events/eventbus/jetstream/sensor/sensor_jetstream.go:85
github.com/argoproj/argo-events/sensors.(*SensorContext).listenEvents.func2
        /home/runner/work/argo-events/argo-events/sensors/listener.go:303
2023-09-23T15:04:39.436Z        ERROR   argo-events.sensor      sensors/listener.go:305 failed to reconnect to eventbus {"sensorName": "webhook", "triggerName": "webhook-email-trigger", "connection": "", "error": "failed to get K/V store for sensor webhook: nats: stream is offline"}
github.com/argoproj/argo-events/sensors.(*SensorContext).listenEvents.func2
        /home/runner/work/argo-events/argo-events/sensors/listener.go:305
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1b8b75d]

goroutine 71 [running]:
github.com/argoproj/argo-events/eventbus/jetstream/sensor.(*JetstreamTriggerConn).IsClosed(0xc000a19700?)
        <autogenerated>:1 +0x1d
github.com/argoproj/argo-events/sensors.(*SensorContext).listenEvents.func2({0xc0008a1d60, {0x0, 0x0, 0x0}, 0x0, 0x0, 0x0, 0x0})
        /home/runner/work/argo-events/argo-events/sensors/listener.go:301 +0x1552
created by github.com/argoproj/argo-events/sensors.(*SensorContext).listenEvents
        /home/runner/work/argo-events/argo-events/sensors/listener.go:130 +0x39e

Normally methods are nil safe but the TriggerConnections are composing a pointer to a connection struct which is actually implimenting methods IsClosed and Close of the interface. The connection struct has nil safe implimentation of the interface but the composed struct TriggerConnections isn't nil safe.

Eg: Jetstream Trigger Connection
https://github.com/argoproj/argo-events/blob/master/eventbus/jetstream/base/jetstream_conn.go#L10-L31

//IsClosed and Close are nil safe for JetstreamConnection
type JetstreamConnection struct {
	NATSConn  *nats.Conn
	JSContext nats.JetStreamContext

	NATSConnected bool

	Logger *zap.SugaredLogger
}

func (jsc *JetstreamConnection) Close() error {
	if jsc == nil {
		return fmt.Errorf("can't close Jetstream connection, JetstreamConnection is nil")
	}
	if jsc.NATSConn != nil && jsc.NATSConn.IsConnected() {
		jsc.NATSConn.Close()
	}
	return nil
}

func (jsc *JetstreamConnection) IsClosed() bool {
	return jsc == nil || jsc.NATSConn == nil || !jsc.NATSConnected || jsc.NATSConn.IsClosed()
}

https://github.com/argoproj/argo-events/blob/master/eventbus/jetstream/sensor/trigger_conn.go#L20-L32

//JetstreamTriggerConn will panic on IsClosed() and Close()
type JetstreamTriggerConn struct {
	*jetstreambase.JetstreamConnection
	sensorName           string
	triggerName          string
	keyValueStore        nats.KeyValue
	dependencyExpression string
	requiresANDLogic     bool
	evaluableExpression  *govaluate.EvaluableExpression
	deps                 []eventbuscommon.Dependency
	sourceDepMap         map[string][]string // maps EventSource and EventName to dependency name
	recentMsgsByID       map[string]*msg     // prevent re-processing the same message as before (map of msg ID to time)
	recentMsgsByTime     []*msg
}

To fix this issue, I think there are three options.

  • Make the methods nil safe
  • Use reflect package for nil check.
  • Have a type switch to check for nil. But this means all the implimentations have to be referenced.

I think the nil safe option is the better choice.

func (conn *JetstreamTriggerConn) IsClosed() bool {
	return conn == nil || conn.JetstreamConnection.IsClosed()
}

func (conn *JetstreamTriggerConn) Close() error {
	if conn == nil {
		return fmt.Errorf("can't close Jetstream trigger connection, JetstreamTriggerConn is nil")
	}
	return conn.JetstreamConnection.Close()
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants