Stop container bug fix #2885

angelcar · 2021-05-26T21:27:20Z

Summary

There is an edge that can't prevent to correctly stop containers if the docker daemon crashes in the middle of stopping a task. This PR addresses this scenario.

Implementation details

Current behavior:

Stop request is received from ACS
Agent begins the task stop process, sending a StopContainer request to Docker
Docker panics while trying to stop the container due to a bug in fluentd log driver (to be fixed upstream).
The ongoing StopContainer request initiated in #2 is aborted because Docker crashes
The agent interprets the error in #4 as a non-retryable error and marks the container as STOPPED, but the container is still in Docker's internal state (i.e. not stopped as far as Docker concerns)
ECS Agent restarts because Docker has crashed (notice ECS Agent doesn't stop at the same time than Docker, but a few seconds afterwards)
Upon ECS Agent restart, the task is marked as STOPPED (because #5)

In the end, the agent thinks that the task is STOPPED, but Docker sometimes doesn't actually finish cleaning up the container from its internal state, since it crashes before being able to.
Aside of being a resource leak, this situation can create further complications, such as network port conflicts.

New Behavior:

Stop request is received from ACS
Agent begins the task stop process, sending a StopContainer request to Docker
Docker panics while trying to stop the container due to a bug in fluentd log driver (to be fixed upstream)
The ongoing StopContainer request initiated in #2 is aborted because Docker crashes
If the error from #2 is retryable, retry up to 5 times
If all retries are exhausted, mark the container as STOPPED *only* if Docker is responsive
ECS Agent restarts because Docker has crashed (notice ECS Agent doesn't stop at the same time than Docker, but a few seconds afterwards)
Upon ECS Agent restart, the task engine syncs its state with Docker and continues the stop process (because the container's known status is still RUNNING)
ECS Agent picks up where it left off and the container is marked as STOPPED

Essentially, the flaw was to mark the container as STOPPED when the StopContainer was aborted due to Docker crashing. When this happens, the error will be something like connection rest by peer, or EOF. The reason we are doing SystemPing is that the mentioned errors can happen legitimately for reasons other than Docker crashing. We only want to ignore StopContainer errors when Docker is unresponsive (even after all the retries were exhausted).

Testing

Modified/Added relevant UTs
Tested on my dev account that:
- Agent DOESN'T mark container as STOPPED after 5 max retries when Docker us unresponsive
- Agent marks the container as STOPPED after 5 max attempts, only if Docker is also responding (i.e. SystemPing is successful)
- Container transitions to STOPPED when there are no StopContainer errors (happy path)

The above tests should guarantee backwards compatibility and only alter the case when task is stopping and Docker has crashed.

New tests cover the changes: yes

Description for the changelog

Fix bug that could incorrectly mark a task as STOPPED when Docker crashed while stopping a container

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Velazquez added 2 commits May 26, 2021 12:55

Add systemPing to docker client

9b77269

Make CannotInspectContainerError when stopping container

d30da28

angelcar force-pushed the angelcar_stop_container_buf-fix branch 4 times, most recently from ba817f4 to b7259b5 Compare May 26, 2021 22:25

angelcar added the bot/test label May 26, 2021

amazon-ecs-bot removed the bot/test label May 26, 2021

angelcar force-pushed the angelcar_stop_container_buf-fix branch from b7259b5 to b52d309 Compare May 27, 2021 22:57

angelcar changed the title ~~[WIP] Stop container buf fix~~ Stop container bug fix May 27, 2021

angelcar force-pushed the angelcar_stop_container_buf-fix branch from b52d309 to 4fa504d Compare May 27, 2021 23:51

angelcar marked this pull request as ready for review May 27, 2021 23:52

Retry StopContainer for a limited number of times

952f401

angelcar force-pushed the angelcar_stop_container_buf-fix branch from 4fa504d to 952f401 Compare May 27, 2021 23:59

angelcar added the bot/test label May 28, 2021

amazon-ecs-bot removed the bot/test label May 28, 2021

mssrivas approved these changes Jun 8, 2021

View reviewed changes

fenxiong approved these changes Jun 8, 2021

View reviewed changes

fierlion approved these changes Jun 8, 2021

View reviewed changes

sharanyad approved these changes Jun 8, 2021

View reviewed changes

angelcar merged commit 2d0d8aa into aws:dev Jun 8, 2021

angelcar mentioned this pull request Jun 9, 2021

V1.53.0 stage #2894

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop container bug fix #2885

Stop container bug fix #2885

angelcar commented May 26, 2021 •

edited

Loading

Stop container bug fix #2885

Stop container bug fix #2885

Conversation

angelcar commented May 26, 2021 • edited Loading

Summary

Implementation details

Testing

Description for the changelog

Licensing

angelcar commented May 26, 2021 •

edited

Loading