-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regression in 1.14, increased amount of 'DockerTimeoutError: Could not transition to started' #682
Comments
I've tried to verify that the behaviour changed since 1.13.1 but it seems that the case of the exited container being marked as RUNNING in ui (ERROR1) is present there as well (our prod account using 1.13.1) |
@mkleint Thanks for reporting this. Can you please provide some additional information to help us debug this issue?
Thanks, |
we keep the main disk on the instance intact but mount another one at /var/lib/docker that is significantly bigger. Since we are building software in the docker tasks, we need a fair amount of disc space for the containers to have. We might be over-provisioning a bit there though but we expect each task to have about 100G at it's disposal. Mainly for backward compatibility with ec2 bamboo agents we have. we've seen performance improvement with overlay in the area of volume-from consumption (we have about 400M sidekick image that only contains data used by the main container)
|
I've sent you the logs by email. |
mklient@ Thank you for sending those logs. I looked at your container instance As far errors that you're seeing, the 2nd error, where the Got timeout from
Container was started by Docker and ECS Agent stops it:
Container was killed:
Can you tell us what your previous setup was? AMI ID, Agent Version and To help debug the first error, can you please provide the Task ARN for which you Thanks, |
thanks for looking into this. Our prod account that doesn't exhibit the problem is using amzn-ami-2016.09.d-amazon-ecs-optimized which should be using docker 1.12.6 + ecs-agent 1.13.1. we've been continually getting DockerTimeout errors on ecs task startup ever since we started with ecs, they've been rare enough for us to just silently retry and ignore. We do keep monitoring of those, I'll send those privately to you again. even when they don't timeout, sometimes the tasks are stuck in pending for a long time. We have a lambda that is being sent the task's stopped state transition from cloudwatch and it generates a datadog count metric with number of seconds between started and running. It seems to be tens of minutes in some cases. |
@mkleint Thank you for sending the logs. I took a look at what you sent and found two things that are interesting:
For the first one, I think we'll get some mitigation by bumping the timeout up to about 3m. I'm not sure why the event stream would be delayed though; that's still worth looking into and might be impacting the time you see tasks in PENDING. |
we've upgraded to 1.14 version of the ecs agent to get performance improvements from the parallel docker image downloads, but we are seeing a significant increase in errors coming from our ecs cluster. we are using the official amzn-ami-2016.09.e-amazon-ecs-optimized with small additions (overlay fs used). We have large instances m4.10xlarge in the ASG/cluster, starting up to 40 tasks on a single instance at the same time.
the structure of our task definition could be important here, so let me describe it a bit.
the 'bamboo-agent' container defines volumes-from the 'bamboo-agent-sidekick'.
So the expectation is that the sidekick container is started first, immediately exits and then bamboo-agent container is started. That one needs the volume/binaries from sidekick to start.
ERROR1: with 1.14 I'm seeing weird behaviour in ECS console where the bamboo-agent-sidekick image is listed with RUNNING state (despite listing exit code of 0)
![2017-01-27_1401](https://cloud.githubusercontent.com/assets/178549/22359200/22d172b0-e499-11e6-9791-6daaddc992f4.png)
ERROR2: likely related to ERROR1, we get bamboo-agent container exiting with error:
Essential container in task exited:bamboo-agent[DockerTimeoutError: Could not transition to started; timed out after waiting 1m30s]
however the container actually was running (I can see the awslogs output of the container) and only seems to be happening at the end of life of the bamboo-agent container.
The text was updated successfully, but these errors were encountered: