Update retry strategy for manifest pulls to help setups that depend on network pause container for repo calls #4289
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
For certain setups, all image repository calls are routed through the task's network pause container. This requires the pause container to not only have started but have completed any initialization and be ready to accept requests. However, currently Agent does not have a great way to detect that the pause container is ready as it only depends on the container being started. On these setups we have observed that the pause container is not ready when Agent makes the first image repository call (that was an image pull call before manifest pull was introduced and is a manifest pull call now after its addition in #4177).
Although Agent's retries of the calls are eventually able to succeed, this does add additional latency to task provisioning times. Since the manifest pull call is a relatively new and additional call that we introduced in #4177, the task provisioning latency has slightly increased exacerbating the existing latency on the above mentioned setups. To alleviate this, this change updates the retry backoff settings for manifest pull call so that the first few retries are quicker (starting from 10 ms) but the backoff increases faster (with a multiplier of 3) eventually capping at 5 seconds. The changes are summarized below. The overall backoff time before giving up is roughly the same as before.
Testing
We tested the changes on an environment where image repository calls are routed through the network pause container and observed a satisfactory reduction in task provisioning latency.
New tests cover the changes: no
Description for the changelog
Enhancement: Update manifest pull retry strategy so that first few retries are quicker to help setups on which image repository calls depend on network pause container being initialized
Additional Information
Does this PR include breaking model changes? If so, Have you added transformation functions?
Does this PR include the addition of new environment variables in the README?
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.