Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
There is an edge that can't prevent to correctly stop containers if the docker daemon crashes in the middle of stopping a task. This PR addresses this scenario.
Implementation details
Current behavior:
StopContainer
request to DockerStopContainer
request initiated in#2
is aborted because Docker crashes#4
as a non-retryable error and marks the container asSTOPPED
, but the container is still in Docker's internal state (i.e. not stopped as far as Docker concerns)STOPPED
(because#5
)In the end, the agent thinks that the task is
STOPPED
, but Docker sometimes doesn't actually finish cleaning up the container from its internal state, since it crashes before being able to.Aside of being a resource leak, this situation can create further complications, such as network port conflicts.
New Behavior:
StopContainer
request to DockerStopContainer
request initiated in#2
is aborted because Docker crashes#2
is retryable, retry up to 5 timesSTOPPED
*only* if Docker is responsiveRUNNING
)STOPPED
Essentially, the flaw was to mark the container as
STOPPED
when theStopContainer
was aborted due to Docker crashing. When this happens, the error will be something likeconnection rest by peer
, orEOF
. The reason we are doingSystemPing
is that the mentioned errors can happen legitimately for reasons other than Docker crashing. We only want to ignoreStopContainer
errors when Docker is unresponsive (even after all the retries were exhausted).Testing
STOPPED
after 5 max retries when Docker us unresponsiveSTOPPED
after 5 max attempts, only if Docker is also responding (i.e.SystemPing
is successful)STOPPED
when there are noStopContainer
errors (happy path)The above tests should guarantee backwards compatibility and only alter the case when task is stopping and Docker has crashed.
New tests cover the changes: yes
Description for the changelog
STOPPED
when Docker crashed while stopping a containerLicensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.