Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop container bug fix #2885

Merged
merged 3 commits into from
Jun 8, 2021
Merged

Conversation

angelcar
Copy link
Contributor

@angelcar angelcar commented May 26, 2021

Summary

There is an edge that can't prevent to correctly stop containers if the docker daemon crashes in the middle of stopping a task. This PR addresses this scenario.

Implementation details

Current behavior:

  1. Stop request is received from ACS
  2. Agent begins the task stop process, sending a StopContainer request to Docker
  3. Docker panics while trying to stop the container due to a bug in fluentd log driver (to be fixed upstream).
  4. The ongoing StopContainer request initiated in #2 is aborted because Docker crashes
  5. The agent interprets the error in #4 as a non-retryable error and marks the container as STOPPED, but the container is still in Docker's internal state (i.e. not stopped as far as Docker concerns)
  6. ECS Agent restarts because Docker has crashed (notice ECS Agent doesn't stop at the same time than Docker, but a few seconds afterwards)
  7. Upon ECS Agent restart, the task is marked as STOPPED (because #5)

In the end, the agent thinks that the task is STOPPED, but Docker sometimes doesn't actually finish cleaning up the container from its internal state, since it crashes before being able to.
Aside of being a resource leak, this situation can create further complications, such as network port conflicts.

New Behavior:

  1. Stop request is received from ACS
  2. Agent begins the task stop process, sending a StopContainer request to Docker
  3. Docker panics while trying to stop the container due to a bug in fluentd log driver (to be fixed upstream)
  4. The ongoing StopContainer request initiated in #2 is aborted because Docker crashes
  5. If the error from #2 is retryable, retry up to 5 times
  6. If all retries are exhausted, mark the container as STOPPED *only* if Docker is responsive
  7. ECS Agent restarts because Docker has crashed (notice ECS Agent doesn't stop at the same time than Docker, but a few seconds afterwards)
  8. Upon ECS Agent restart, the task engine syncs its state with Docker and continues the stop process (because the container's known status is still RUNNING)
  9. ECS Agent picks up where it left off and the container is marked as STOPPED

Essentially, the flaw was to mark the container as STOPPED when the StopContainer was aborted due to Docker crashing. When this happens, the error will be something like connection rest by peer, or EOF. The reason we are doing SystemPing is that the mentioned errors can happen legitimately for reasons other than Docker crashing. We only want to ignore StopContainer errors when Docker is unresponsive (even after all the retries were exhausted).

Testing

  • Modified/Added relevant UTs
  • Tested on my dev account that:
    • Agent DOESN'T mark container as STOPPED after 5 max retries when Docker us unresponsive
    • Agent marks the container as STOPPED after 5 max attempts, only if Docker is also responding (i.e. SystemPing is successful)
    • Container transitions to STOPPED when there are no StopContainer errors (happy path)

The above tests should guarantee backwards compatibility and only alter the case when task is stopping and Docker has crashed.

New tests cover the changes: yes

Description for the changelog

  • Fix bug that could incorrectly mark a task as STOPPED when Docker crashed while stopping a container

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@angelcar angelcar force-pushed the angelcar_stop_container_buf-fix branch 4 times, most recently from ba817f4 to b7259b5 Compare May 26, 2021 22:25
@angelcar angelcar force-pushed the angelcar_stop_container_buf-fix branch from b7259b5 to b52d309 Compare May 27, 2021 22:57
@angelcar angelcar changed the title [WIP] Stop container buf fix Stop container bug fix May 27, 2021
@angelcar angelcar force-pushed the angelcar_stop_container_buf-fix branch from b52d309 to 4fa504d Compare May 27, 2021 23:51
@angelcar angelcar marked this pull request as ready for review May 27, 2021 23:52
@angelcar angelcar force-pushed the angelcar_stop_container_buf-fix branch from 4fa504d to 952f401 Compare May 27, 2021 23:59
@angelcar angelcar merged commit 2d0d8aa into aws:dev Jun 8, 2021
@angelcar angelcar mentioned this pull request Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants