-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: make agent reconnect in case of connection dropping #3342
Conversation
@@ -9,11 +9,15 @@ import ( | |||
|
|||
func (c *Client) startHearthBeat(ctx context.Context) error { | |||
client := proto.NewOrchestratorClient(c.conn) | |||
ticker := time.NewTicker(2 * time.Minute) | |||
ticker := time.NewTicker(c.config.PingPeriod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the ping to 30s
agent/client/client.go
Outdated
// connection is not working. We need to reconnect | ||
err := retry.Do(func() error { | ||
return c.connect(context.Background()) | ||
}, retry.Attempts(3), retry.Delay(1*time.Second)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will do 3 attempts with 1sec between them, right? Is that correct? Maybe we want to make it longer? like at least 1 min total?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats delay doubles every attempt. So that would be 7 seconds total. If we set attempts as 6 we would get a bit longer than 1 minute
return c.reconnect() | ||
}) | ||
if err == nil { | ||
// everything was reconnect, so we can exist this goroutine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this comment mean? I don't understand it :S
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we call reconnect
. It needs to restart the client. Which calls this function again with amother goroutine
agent/client/client.go
Outdated
// connection is not working. We need to reconnect | ||
err := retry.Do(func() error { | ||
return c.connect(context.Background()) | ||
}, retry.Attempts(3), retry.Delay(1*time.Second)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add these numbers as constants in the code? Something like:
// somewhere
const reconnectRetryAttempts = 3
const reconnectRetryAttemptDelay = 1 * time.Second
// here
err := retry.Do(func() error {
return c.connect(context.Background())
}, retry.Attempts(reconnectRetryAttempts), retry.Delay(reconnectRetryAttemptDelay))
agent/client/connector.go
Outdated
if err != nil { | ||
return nil, err | ||
config := Config{ | ||
PingPeriod: 30 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here. Can we add this config as a constant?
if err != nil && isConnectionError(err) { | ||
err = retry.Do(func() error { | ||
return c.reconnect() | ||
}) | ||
if err == nil { | ||
// everything was reconnect, so we can exist this goroutine | ||
// as there's another one running in parallel | ||
return | ||
} | ||
|
||
log.Fatal(err) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all this functiosn are copy/pasted. can't they be generalizaed? like:
func (c Client) handleDisconnectionError(inputErr error) error {
if !isConnectionError(inputErr) {
// if it's nil or any error other than the one we care about, return it and let the caller handle it
return inputErr
}
return retry.Do(func() error {
return c.reconnect()
})
}
and the
if err != nil && isConnectionError(err) { | |
err = retry.Do(func() error { | |
return c.reconnect() | |
}) | |
if err == nil { | |
// everything was reconnect, so we can exist this goroutine | |
// as there's another one running in parallel | |
return | |
} | |
log.Fatal(err) | |
} | |
err = c.handleConnectionError(err) | |
if err != nil { | |
log.Fatal(err) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the core idea of your suggestion.
I didn't follow it 100% for two reasons:
- I cannot exit the process if any error happens, only if it's a connection error and we can't connect.
- I still need to exit the goroutine because when
c.reconnect()
is called, a new goroutine will spawn with a newstream
object. When reconnecting to a grpc server, we also need to get a newstream
using the new connection. So I have to have a way of ensuring that the old goroutine will not run again. That's why we have this:
if err == nil {
return
}
It means that the reconnection was sucessfull and now we have a new grpc.Conn
and new grpc streams
as well.
This PR makes the agent reconnect to the server in case of a grpc failure (server disconnects or any other error). If not possible to reconnect, the agent will exit.
Checklist