Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(agent): configure retry strategy #3317

Merged
merged 4 commits into from
Oct 30, 2023
Merged

feat(agent): configure retry strategy #3317

merged 4 commits into from
Oct 30, 2023

Conversation

schoren
Copy link
Contributor

@schoren schoren commented Oct 30, 2023

This PR configures the gRPC client retry strategy to allow better handling of connection issues. In the case of connection interruption between the server and the agent for whatever reason (server restarted, networking issues, etc), the client will retry the operation using an exponential backoff strategy. With the currently configured values, this is the retry timeline:

Attempt 1 at 0.1 seconds, Backoff Time: 0.1 seconds
Attempt 2 at 0.3 seconds, Backoff Time: 0.2 seconds
Attempt 3 at 0.7 seconds, Backoff Time: 0.4 seconds
Attempt 4 at 1.5 seconds, Backoff Time: 0.8 seconds
Attempt 5 at 3.1 seconds, Backoff Time: 1.6 seconds
Attempt 6 at 6.3 seconds, Backoff Time: 3.2 seconds
Attempt 7 at 11.3 seconds, Backoff Time: 5.0 seconds
Attempt 8 at 16.3 seconds, Backoff Time: 5.0 seconds
Attempt 9 at 21.3 seconds, Backoff Time: 5.0 seconds
Attempt 10 at 26.3 seconds, Backoff Time: 5.0 seconds
Attempt 11 at 31.3 seconds, Backoff Time: 5.0 seconds
Attempt 12 at 36.3 seconds, Backoff Time: 5.0 seconds
Attempt 13 at 41.3 seconds, Backoff Time: 5.0 seconds
Attempt 14 at 46.3 seconds, Backoff Time: 5.0 seconds
Attempt 15 at 51.3 seconds, Backoff Time: 5.0 seconds
Attempt 16 at 56.3 seconds, Backoff Time: 5.0 seconds
Attempt 17 at 61.3 seconds, Backoff Time: 5.0 seconds
Attempt 18 at 66.3 seconds, Backoff Time: 5.0 seconds
Attempt 19 at 71.3 seconds, Backoff Time: 5.0 seconds
Attempt 20 at 76.3 seconds, Backoff Time: 5.0 seconds
Attempt 21 at 81.3 seconds, Backoff Time: 5.0 seconds
Attempt 22 at 86.3 seconds, Backoff Time: 5.0 seconds
Attempt 23 at 91.3 seconds, Backoff Time: 5.0 seconds
Attempt 24 at 96.3 seconds, Backoff Time: 5.0 seconds
Attempt 25 at 101.3 seconds, Backoff Time: 5.0 seconds
Attempt 26 at 106.3 seconds, Backoff Time: 5.0 seconds
Attempt 27 at 111.3 seconds, Backoff Time: 5.0 seconds
Attempt 28 at 116.3 seconds, Backoff Time: 5.0 seconds
Attempt 29 at 121.3 seconds, Backoff Time: 5.0 seconds

If  the operation was not succesful after the last attempt, it will fail, causing the agent to exit with an erro message.

This behaviour is specially useful on docker/k8s environments, since it kills the container and makes the orchestrator restart it, allowing the process to be restarted until a succesful connection is established

@schoren schoren requested review from mathnogueira, xoscar and danielbdias and removed request for mathnogueira, xoscar and danielbdias October 30, 2023 13:53
@@ -17,52 +17,52 @@ import (
var ErrOtlpServerStart = errors.New("OTLP server start error")

func NewClient(ctx context.Context, config config.Config, traceCache collector.TraceCache) (*client.Client, error) {
client, err := client.Connect(ctx, config.ServerURL,
c, err := client.Connect(ctx, config.ServerURL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you renaming it to c? why not leave it as client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was leftover from a previous attempt, but using client as the name overrides the client package import, and it's not a great practice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be ok with c if it was an anchor variable. But as we use it for some config, I would like a more meaningful name like grpcClient

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we use a verbose name that specifies what this client is? Something like tracetestClient or controlPlaneClient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to controlPlaneClient, feels like the most meaningful to me

@schoren schoren merged commit e289cae into main Oct 30, 2023
37 checks passed
@schoren schoren deleted the agent-reconnect branch October 30, 2023 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants