regression in 1.14, increased amount of 'DockerTimeoutError: Could not transition to started' #682

mkleint · 2017-01-27T03:45:41Z

we've upgraded to 1.14 version of the ecs agent to get performance improvements from the parallel docker image downloads, but we are seeing a significant increase in errors coming from our ecs cluster. we are using the official amzn-ami-2016.09.e-amazon-ecs-optimized with small additions (overlay fs used). We have large instances m4.10xlarge in the ASG/cluster, starting up to 40 tasks on a single instance at the same time.

the structure of our task definition could be important here, so let me describe it a bit.

we have a 'bamboo-agent' container that is 'essential and starts up an Atlassian Bamboo remote agent, connects to Bamboo builds a job and exits.
the Bamboo Agent binaries + JRE are on a separate image 'bambo-agent-sidekick' that is very simple:

FROM tianon/true
ADD output/volume/ /buildeng/
ADD output/bamboo-agent-home /buildeng/bamboo-agent-home
VOLUME /buildeng/

the 'bamboo-agent' container defines volumes-from the 'bamboo-agent-sidekick'.
So the expectation is that the sidekick container is started first, immediately exits and then bamboo-agent container is started. That one needs the volume/binaries from sidekick to start.

ERROR1: with 1.14 I'm seeing weird behaviour in ECS console where the bamboo-agent-sidekick image is listed with RUNNING state (despite listing exit code of 0)

ERROR2: likely related to ERROR1, we get bamboo-agent container exiting with error:
Essential container in task exited:bamboo-agent[DockerTimeoutError: Could not transition to started; timed out after waiting 1m30s]
however the container actually was running (I can see the awslogs output of the container) and only seems to be happening at the end of life of the bamboo-agent container.

The text was updated successfully, but these errors were encountered:

mkleint · 2017-01-27T04:16:35Z

I've tried to verify that the behaviour changed since 1.13.1 but it seems that the case of the exited container being marked as RUNNING in ui (ERROR1) is present there as well (our prod account using 1.13.1)
But we are not seeing ERROR2 on the prod account (with older version) but we do frequently in the staging account.

aaithal · 2017-01-27T16:40:46Z

@mkleint Thanks for reporting this. Can you please provide some additional information to help us debug this issue?

Docker Storage Setup: By default, we set the devicemapper filesystem with ext4. When you say that you've configured it with overlayfs, are you overriding the /etc/sysconfig/docker-storage-setup using a cloud boot hook to do that? We chose ext4 over overlay due to it not behaving like a POSIX-compliant filesystem, which would break things like yum. Can you provide more details about your use-case that prevent you from using the ext4 setup?
EBS Volumes and IOPS Burst Balance: Have you configured the secondary volume to be of larger size or with higher IOPS? By default, the ECS Optimized AMI uses a 22 GiB gp2 volume, which has a bust performance threshold of 3000 IOPS and a baseline performance of 100 IOPS. We've seen cases where IO intensive workloads end up consuming this, resulting in slowness with docker operations. Can you please check your IOPS balance when you run into the error scenarios that you've mentioned and ensure that you have enough credits to not cause any slowness? Here's a blog post that explains how to do this. Also you can find more documentation regarding gp2 volumes here
ECS Agent and Docker logs: Can you please share the Agent and the Docker logs when you start seeing this error? You can send them to aithal at amazon dot com. You can also use the ECS Logs Collector for this.

Thanks,
Anirudh

mkleint · 2017-01-29T22:45:37Z

we keep the main disk on the instance intact but mount another one at /var/lib/docker that is significantly bigger. Since we are building software in the docker tasks, we need a fair amount of disc space for the containers to have. We might be over-provisioning a bit there though but we expect each task to have about 100G at it's disposal. Mainly for backward compatibility with ec2 bamboo agents we have.

we've seen performance improvement with overlay in the area of volume-from consumption (we have about 400M sidekick image that only contains data used by the main container)

bootcmd:
 - modprobe overlay
 - [ cloud-init-per, once, enable_overlay, sh, -xc, "echo 'DOCKER_STORAGE_OPTIONS=\"--storage-driver overlay\"' > /etc/sysconfig/docker-storage" ]
 - [ cloud-init-per, once, docker_fs_1, pvcreate, -y, /dev/xvdcz ]
 - [ cloud-init-per, once, docker_fs_2, vgcreate, docker, /dev/xvdcz ]
 - [ cloud-init-per, once, docker_fs_3, lvcreate, -n, docker-pool, -l, 100%FREE, docker]
 - [ cloud-init-per, once, docker_fs_4, mkfs.ext4, /dev/mapper/docker-docker--pool ]
 - [ cloud-init-per, once, docker_fs_5, sh, -xc, "echo '/dev/mapper/docker-docker--pool /var/lib/docker ext4 defaults 0 0' >> /etc/fstab" ]
 - [ cloud-init-per, once, docker_fs_6, mount, /var/lib/docker ]

mkleint · 2017-01-29T23:00:02Z

I've sent you the logs by email.

aaithal · 2017-01-30T21:45:46Z

mklient@ Thank you for sending those logs. I looked at your container instance
and it certainly has enough IO burst balance required to do it's job. So, we can
rule that out. Also, when you say you upgraded to 1.14.0 Agent, what were you
running before? I ask because you might also upgraded the version of your
Docker daemon, which might have resulted in different performance
characteristics.

As far errors that you're seeing, the 2nd error, where the
container starts even though it's marked as STOPPED by ECS also points to
performance issues with the docker setup. We wrap all calls to the Docker daemon
with timeouts. The StartContainer call to Docker times out at 90 seconds.
If the call does time out, we consider that an error and mark the container as
STOPPED. You can see this in the log as well, where even if it eventually
gets started by the Docker daemon, it's killed because of it's expected state
being STOPPED.

Got timeout from StartContainer:

2017-01-29T21:48:02Z [INFO] Error transitioning container module="TaskEngine" task="xxx:yy arn:aws:ecs:us-east-1:xxxxxx:task/8e9f685e-xxxx, Status: (CREATED->RUNNING) Containers: [bamboo-agent-sidekick (RUNNING->RUNNING),bamboo-agent (RUNNING->RUNNING),bamboo-agent-metadata (CREATED->RUNNING),~internal~ecs-emptyvolume-source (STOPPED->STOPPED),]" container="bamboo-agent-metadata(docker.atlassian.io/buildeng/ecs-docker-metadata) (CREATED->RUNNING)" state="RUNNING" error="Could not transition to started; timed out after waiting 1m30s"

Container was started by Docker and ECS Agent stops it:

2017-01-29T21:48:32Z [WARN] Container that we thought was stopped came back; re-stopping it once module="TaskEngine" task="xxx:yy arn:aws:ecs:us-east-1:xxxxxx:task/8e9f685e-xxxx, Status: (RUNNING->RUNNING) Containers: [bamboo-agent-sidekick (STOPPED->RUNNING),bamboo-agent (RUNNING->RUNNING),bamboo-agent-metadata (STOPPED->STOPPED),~internal~ecs-emptyvolume-source (STOPPED->STOPPED),]"

Container was killed:

2017-01-29T21:48:32Z [DEBUG] Got event from docker daemon module="TaskEngine" event="&{Action:kill Type:container Actor:{ID:0f3c83bcb0c60f161c1438ae219e7c0fca6c2f14f9df2d71620cf4ae0f78ee33 Attributes:map[signal:15 com.amazonaws.ecs.cluster:staging_bamboo com.amazonaws.ecs.container-name:bamboo-agent-metadata com.amazonaws.ecs.task-arn:arn:aws:ecs:us-east-1:xxxxxx:task/8e9f685e-xxxx com.amazonaws.ecs.task-definition-family:xxx com.amazonaws.ecs.task-definition-version:yy image:docker.atlassian.io/buildeng/ecs-docker-metadata name:ecs-pxxx]} Status:kill ID:0f3c83bcb0c60f161c1438ae219e7c0fca6c2f14f9df2d71620cf4ae0f78ee33 From:docker.atlassian.io/buildeng/ecs-docker-metadata Time:1485726512 TimeNano:1485726512087629051}"

Can you tell us what your previous setup was? AMI ID, Agent Version and
Docker Version would be a good start and would help us determine if updating
the Docker version could be something that's leading to this. Also, is this
happening across multiple instances in your cluster, or is this happening on
just this instance?

To help debug the first error, can you please provide the Task ARN for which you
were seeing that? Again, if you're uncomfortable with posting that on github,
you can send it to me via an email.

Thanks,
Anirudh

mkleint · 2017-01-30T23:22:43Z

thanks for looking into this. Our prod account that doesn't exhibit the problem is using amzn-ami-2016.09.d-amazon-ecs-optimized which should be using docker 1.12.6 + ecs-agent 1.13.1.

we've been continually getting DockerTimeout errors on ecs task startup ever since we started with ecs, they've been rare enough for us to just silently retry and ignore. We do keep monitoring of those, I'll send those privately to you again.

even when they don't timeout, sometimes the tasks are stuck in pending for a long time. We have a lambda that is being sent the task's stopped state transition from cloudwatch and it generates a datadog count metric with number of seconds between started and running. It seems to be tens of minutes in some cases.

samuelkarp · 2017-02-03T00:37:08Z

@mkleint Thank you for sending the logs. I took a look at what you sent and found two things that are interesting:

Not every container is timing out, but some of the containers that are not timing out are still pretty close to the 90s timeout.
Events from the Docker event stream are delayed by 30s-120s.

For the first one, I think we'll get some mitigation by bumping the timeout up to about 3m. I'm not sure why the event stream would be delayed though; that's still worth looking into and might be impacting the time you see tasks in PENDING.

samuelkarp · 2017-03-27T18:19:01Z

Released

samuelkarp mentioned this issue Feb 3, 2017

engine: Increase start and create timeouts #696

Merged

8 tasks

samuelkarp added this to the 1.14.1 milestone Feb 3, 2017

samuelkarp added the pending release label Feb 8, 2017

samuelkarp closed this as completed Mar 27, 2017

samuelkarp removed the pending release label Mar 27, 2017

aaithal mentioned this issue Sep 15, 2017

Task stuck in pending state #731

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regression in 1.14, increased amount of 'DockerTimeoutError: Could not transition to started' #682

regression in 1.14, increased amount of 'DockerTimeoutError: Could not transition to started' #682

mkleint commented Jan 27, 2017 •

edited

Loading

mkleint commented Jan 27, 2017

aaithal commented Jan 27, 2017 •

edited

Loading

mkleint commented Jan 29, 2017

mkleint commented Jan 29, 2017

aaithal commented Jan 30, 2017

mkleint commented Jan 30, 2017

samuelkarp commented Feb 3, 2017

samuelkarp commented Mar 27, 2017

regression in 1.14, increased amount of 'DockerTimeoutError: Could not transition to started' #682

regression in 1.14, increased amount of 'DockerTimeoutError: Could not transition to started' #682

Comments

mkleint commented Jan 27, 2017 • edited Loading

mkleint commented Jan 27, 2017

aaithal commented Jan 27, 2017 • edited Loading

mkleint commented Jan 29, 2017

mkleint commented Jan 29, 2017

aaithal commented Jan 30, 2017

mkleint commented Jan 30, 2017

samuelkarp commented Feb 3, 2017

samuelkarp commented Mar 27, 2017

mkleint commented Jan 27, 2017 •

edited

Loading

aaithal commented Jan 27, 2017 •

edited

Loading