Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: make agent reconnect in case of connection dropping #3342

Merged
merged 13 commits into from
Nov 9, 2023

Conversation

mathnogueira
Copy link
Contributor

@mathnogueira mathnogueira commented Nov 8, 2023

This PR makes the agent reconnect to the server in case of a grpc failure (server disconnects or any other error). If not possible to reconnect, the agent will exit.

Checklist

  • tested locally
  • added new dependencies
  • updated the docs
  • added a test

@@ -9,11 +9,15 @@ import (

func (c *Client) startHearthBeat(ctx context.Context) error {
client := proto.NewOrchestratorClient(c.conn)
ticker := time.NewTicker(2 * time.Minute)
ticker := time.NewTicker(c.config.PingPeriod)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the ping to 30s

// connection is not working. We need to reconnect
err := retry.Do(func() error {
return c.connect(context.Background())
}, retry.Attempts(3), retry.Delay(1*time.Second))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will do 3 attempts with 1sec between them, right? Is that correct? Maybe we want to make it longer? like at least 1 min total?

Copy link
Contributor Author

@mathnogueira mathnogueira Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats delay doubles every attempt. So that would be 7 seconds total. If we set attempts as 6 we would get a bit longer than 1 minute

return c.reconnect()
})
if err == nil {
// everything was reconnect, so we can exist this goroutine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this comment mean? I don't understand it :S

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we call reconnect. It needs to restart the client. Which calls this function again with amother goroutine

// connection is not working. We need to reconnect
err := retry.Do(func() error {
return c.connect(context.Background())
}, retry.Attempts(3), retry.Delay(1*time.Second))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add these numbers as constants in the code? Something like:

// somewhere
const reconnectRetryAttempts = 3
const reconnectRetryAttemptDelay = 1 * time.Second

// here 
	err := retry.Do(func() error {
		return c.connect(context.Background())
	}, retry.Attempts(reconnectRetryAttempts), retry.Delay(reconnectRetryAttemptDelay))

if err != nil {
return nil, err
config := Config{
PingPeriod: 30 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here. Can we add this config as a constant?

Comment on lines 28 to 40
if err != nil && isConnectionError(err) {
err = retry.Do(func() error {
return c.reconnect()
})
if err == nil {
// everything was reconnect, so we can exist this goroutine
// as there's another one running in parallel
return
}

log.Fatal(err)
}

Copy link
Contributor

@schoren schoren Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this functiosn are copy/pasted. can't they be generalizaed? like:

func (c Client) handleDisconnectionError(inputErr error) error {
	  if !isConnectionError(inputErr) {
       // if it's nil or any error other than the one we care about, return it and let the caller handle it
        return inputErr
		}
    return retry.Do(func() error {
		    return c.reconnect()
		})
}

and the

Suggested change
if err != nil && isConnectionError(err) {
err = retry.Do(func() error {
return c.reconnect()
})
if err == nil {
// everything was reconnect, so we can exist this goroutine
// as there's another one running in parallel
return
}
log.Fatal(err)
}
err = c.handleConnectionError(err)
if err != nil {
log.Fatal(err)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the core idea of your suggestion.

I didn't follow it 100% for two reasons:

  1. I cannot exit the process if any error happens, only if it's a connection error and we can't connect.
  2. I still need to exit the goroutine because when c.reconnect() is called, a new goroutine will spawn with a new stream object. When reconnecting to a grpc server, we also need to get a new stream using the new connection. So I have to have a way of ensuring that the old goroutine will not run again. That's why we have this:
if err == nil {
  return
}

It means that the reconnection was sucessfull and now we have a new grpc.Conn and new grpc streams as well.

@mathnogueira mathnogueira requested a review from schoren November 9, 2023 22:05
@mathnogueira mathnogueira merged commit 3c24e6e into main Nov 9, 2023
38 checks passed
@mathnogueira mathnogueira deleted the feat/agent-reconnection branch November 9, 2023 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants