Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update retry strategy for manifest pulls to help setups that depend on network pause container for repo calls #4289

Merged
merged 6 commits into from
Aug 21, 2024

Conversation

amogh09
Copy link
Contributor

@amogh09 amogh09 commented Aug 16, 2024

Summary

For certain setups, all image repository calls are routed through the task's network pause container. This requires the pause container to not only have started but have completed any initialization and be ready to accept requests. However, currently Agent does not have a great way to detect that the pause container is ready as it only depends on the container being started. On these setups we have observed that the pause container is not ready when Agent makes the first image repository call (that was an image pull call before manifest pull was introduced and is a manifest pull call now after its addition in #4177).

Although Agent's retries of the calls are eventually able to succeed, this does add additional latency to task provisioning times. Since the manifest pull call is a relatively new and additional call that we introduced in #4177, the task provisioning latency has slightly increased exacerbating the existing latency on the above mentioned setups. To alleviate this, this change updates the retry backoff settings for manifest pull call so that the first few retries are quicker (starting from 10 ms) but the backoff increases faster (with a multiplier of 3) eventually capping at 5 seconds. The changes are summarized below. The overall backoff time before giving up is roughly the same as before.

Setting Before After
minimum backoff 1.1 seconds 10 milliseconds
multiplier 2 3
maximum backoff 5 seconds 5 seconds
retry attempts 5 9
Total backoff 0 + 1.1 + 2.2 + 4.4 + 5 = 12.7 seconds 0 + 10 + 30 + 90 + 270 + 810 + 2430 + 5000 + 5000 = 13.6 seconds

Testing

We tested the changes on an environment where image repository calls are routed through the network pause container and observed a satisfactory reduction in task provisioning latency.

New tests cover the changes: no

Description for the changelog

Enhancement: Update manifest pull retry strategy so that first few retries are quicker to help setups on which image repository calls depend on network pause container being initialized

Additional Information

Does this PR include breaking model changes? If so, Have you added transformation functions?

Does this PR include the addition of new environment variables in the README?

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@amogh09 amogh09 changed the base branch from master to dev August 16, 2024 18:58
@amogh09 amogh09 force-pushed the manifest-retry-update branch from 73616f4 to 6e4ecd9 Compare August 16, 2024 21:14
@amogh09 amogh09 marked this pull request as ready for review August 16, 2024 21:14
@amogh09 amogh09 requested a review from a team as a code owner August 16, 2024 21:14
@amogh09 amogh09 force-pushed the manifest-retry-update branch from 6e4ecd9 to 5e6f73d Compare August 19, 2024 20:33
@amogh09 amogh09 force-pushed the manifest-retry-update branch from 5e6f73d to 037f184 Compare August 20, 2024 18:57
@amogh09 amogh09 merged commit 713ccbd into aws:dev Aug 21, 2024
40 checks passed
@danehlim danehlim mentioned this pull request Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants