Skip to content
This repository has been archived by the owner on Oct 16, 2020. It is now read-only.

docker ps hangs #1654

Closed
crawford opened this issue Nov 9, 2016 · 42 comments
Closed

docker ps hangs #1654

crawford opened this issue Nov 9, 2016 · 42 comments

Comments

@crawford
Copy link
Contributor

crawford commented Nov 9, 2016

Issue Report

Bug

CoreOS Version

1185.3.0

Environment

AWS and GCE confirmed. Likely all environments.

Expected Behavior

docker ps properly returns the list of running containers.

Actual Behavior

docker ps eventually hangs (not sure which conditions cause it yet).

Reproduction Steps

In @matti's case:

$ docker ps -n 17
# works

$ docker ps -n 18
# hangs

Other Information

strace:

$ strace -s  128 docker ps -n 18
clock_gettime(CLOCK_MONOTONIC, {484793, 504192514}) = 0
futex(0xc820064108, FUTEX_WAKE, 1)      = 1
read(5, 0xc820400000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(5, "GET /v1.23/containers/json?limit=18 HTTP/1.1\r\nHost: \r\nUser-Agent: Docker-Client/1.11.2 (linux)\r\n\r\n", 98) = 98
futex(0xc820064108, FUTEX_WAKE, 1)      = 1
futex(0x22509c8, FUTEX_WAIT, 0, NULL
@crawford
Copy link
Contributor Author

crawford commented Nov 9, 2016

@chancez pointed out that moby/moby#13885 and moby/libnetwork#1507 might be related.

@matti
Copy link

matti commented Nov 10, 2016

We run 5 machines and get this hang ~weekly (for the last 3 months). Is there anything we could run on the machines when the hang happens again?

@matti
Copy link

matti commented Nov 10, 2016

machines are 8CPU/30gb.

During the last hang we were able to start new containers while docker ps was hanging.

@Jwpe
Copy link

Jwpe commented Nov 10, 2016

I am also experiencing this issue with the following setup:

CoreOS Version: 1185.3.0
Docker Server Version: 1.11.2

Not only does docker ps hang, but docker-compose commands fail with the following error message:

Nov 10 10:59:20 <aws_dns>.eu-west-1.compute.internal docker-compose[5872]: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
Nov 10 10:59:20 <aws_dns>.eu-west-1.compute.internal docker-compose[5872]: If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

In the Docker service logs these are the only errors I can see:

Nov 10 09:38:36 <aws_dns>.eu-west-1.compute.internal dockerd[972]: time="2016-11-10T09:38:36.701843769Z" level=error msg="attach: stdout: write unix /var/run/docker.sock->@: write: broken pipe"
Nov 10 09:38:36 <aws_dns>.eu-west-1.compute.internal dockerd[972]: time="2016-11-10T09:38:36.704819504Z" level=error msg="attach: stderr: write unix /var/run/docker.sock->@: write: broken pipe"

I am able to temporarily resolve the issue by manually restarting docker.service.

@matti
Copy link

matti commented Nov 10, 2016

@Jwpe when this happens again, try docker ps -n 1, docker ps -n 2 ... until it hangs just to verify that the symptoms are the same.

docker-compose uses the same HTTP api so it will hang as well. So just to make clear the title for this bug is somewhat misleading -- docker ps is just good command to detect this bug because it iterates through all the containers in the API.

@Jwpe
Copy link

Jwpe commented Nov 10, 2016

@matti thanks for the tip, will do!

@Jwpe
Copy link

Jwpe commented Nov 10, 2016

@matti for me docker ps -n 3 works correctly, but docker ps -n 4 hangs...

@matti
Copy link

matti commented Nov 10, 2016

@Jwpe (or anyone else reading this) -- please send SIGUSR1 to the docker daemon process and paste in the logs.

@Jwpe
Copy link

Jwpe commented Nov 10, 2016

@matti I get a huge goroutine stack dump on sending that signal:

ps -C docker
  PID TTY          TIME CMD
20791 ?        00:00:01 docker
sudo kill -s USR1 20791
journalctl -u docker -n 100
Nov 10 12:32:08 <aws_dns>.eu-west-1.compute.internal dockerd[20791]: time="2016-11-10T12:32:08.436895136Z" level=info msg="=== BEGIN goroutine stack dump ===\ngoroutine 20 [running]:\n
...

@matti
Copy link

matti commented Nov 10, 2016

@Jwpe yes... please paste the rest of it...

@juhazi over here formatted our stacktrace here: https://gist.github.com/juhazi/fbf22602561b719e9480f8be4f8a4740

@matti
Copy link

matti commented Nov 10, 2016

and for the record: this time all daemon API's were jammed:

getpeername(5, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
futex(0xc820024d08, FUTEX_WAKE, 1)      = 1
read(5, 0xc8203d6000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(5, "GET /v1.23/containers/json HTTP/1.1\r\nHost: \r\nUser-Agent: Docker-Client/1.11.2 (linux)\r\n\r\n", 89) = 89
futex(0xc820024d08, FUTEX_WAKE, 1)      = 1
futex(0x22509c8, FUTEX_WAIT, 0, NULL

@basvdlei
Copy link

basvdlei commented Nov 10, 2016

We are also seeing this behavior on some 1185.3.0 nodes (on VMWare ESXi) where we are using systemd units with the docker kill, docker rm, docker pull ExecStartPre's as described here.

Here are some goroutine stack dump's of 3 different machines that had the issue that docker ps didn't respond unless given a lower -n value.

In our case the units also created the following error messages:

Handler for POST /v1.23/containers/create returned error: Conflict. The name \"/xxxxxxxxxxx\" is already in use by container 0bf826dafe929711c98d65fa812ed75c4086dc3075e1c148fd1ebfd5c28b0544. You have to remove (or rename) that container to be able to reuse that name."
Handler for POST /v1.23/containers/xxxxxxxxxxx/stop returned error: No such container: xxxxxxxxxxx"
Handler for POST /v1.23/containers/xxxxxxxxxxx/kill returned error: Cannot kill container xxxxxxxxxxx: No such container: xxxxxxxxxxx"
Handler for DELETE /v1.23/containers/xxxxxxxxxxx returned error: No such container: xxxxxxxxxxx"

Even though the stop/kill/rm said the container was not there. Which looks a lot like moby/moby#21198 which might be related

Just to be complete, we are also running cadvisor and a custom container that calls list containers about every 5min.

@matti
Copy link

matti commented Nov 10, 2016

We are also running cadvisor (as the cluster runs http://www.kontena.io)

@tom-pryor
Copy link

tom-pryor commented Nov 11, 2016

Having same issue since upgrading to 1185.3.0, here are the logs: https://gist.github.com/anonymous/0ad15cacc028c38ff7759abba7ace198

@bkleef
Copy link

bkleef commented Nov 14, 2016

Same here at Digital Ocean. Let me know if you need more logs, happy to help.

$ docker version
Client:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.6.3
 Git commit:   bac3bae
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.6.3
 Git commit:   bac3bae
 Built:
 OS/Arch:      linux/amd64
$ uname -a
Linux 4.7.3-coreos-r2 #1 SMP Tue Nov 1 01:38:43 UTC 2016 x86_64 Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz GenuineIntel GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=1185.3.0
DISTRIB_CODENAME="MoreOS"
DISTRIB_DESCRIPTION="CoreOS 1185.3.0 (MoreOS)"

@bkleef
Copy link

bkleef commented Nov 14, 2016

I've updated CoreOS 1185.3.0 to 1192.2.0 (Docker 1.11.2 to 1.12.1). Will keep you guys posted!

@Jwpe
Copy link

Jwpe commented Nov 16, 2016

Has anyone had any joy here? I'm seeing this every two or three restarts of my containers, and it's currently limiting my ability to repeatably deploy.

@victorgp
Copy link

We had to roll back to 1122.3.0 and it is working

Version 1185.3.0 it is not usable, every ~24hours the docker daemon gets unresponsive.

@matti
Copy link

matti commented Nov 21, 2016

Daemon jammed again, this time not even docker ps -n 1 worked.

@Jwpe
Copy link

Jwpe commented Nov 23, 2016

@bkleef are you seeing this issue still on 1192.2.0? Thanks!

@bkleef
Copy link

bkleef commented Nov 23, 2016

@Jwpe yes it is fixed at 1192.2.0!

@Raffo
Copy link

Raffo commented Nov 29, 2016

@Jwpe It's not fixed for us in 1192.2.0.

@victorgp
Copy link

I don't think that bug is related, it was opened in Jun 2015 and this started to happen recently with the latests CoreOS versions. If that's the bug, then it should've happened in older CoreOS versions

@wallneradam
Copy link

wallneradam commented Nov 30, 2016

@victorgp It is almost sure that it is related if you read all the comments or at least the latest ones. Just it is more often under Docker v1.11.0 and up. Which is only in the latest CoreOS (stable) not the previous one.
Not only docker ps hangs, but you cannot start or stop containers as well.

Anyway my suggestion is working for me for about a week. And it was not too hard to rewrite my services, which are needed to run processes or new containers, to ctr, which is integrated into CoreOS...
(The ctr utility is the CLI interface for containerd, which Docker uses as well behind the scene from v1.11.0).

@Jwpe
Copy link

Jwpe commented Nov 30, 2016

@Raffo I haven't encountered the issue since upgrading to 1192.2.0, but I might just not have recreated the scenario where it occurs yet.

@philk
Copy link

philk commented Dec 1, 2016

We're also seeing this on 1185.3.0. At least on the one system it's currently happening on -n9 works but -n10 fails. The ctr command can communicate with things but this failing docker daemon breaks the kubelet making this a completely failed node. Specifically pods get stuck in Pending/ContainerCreating/Terminating.

@crawford crawford added this to the CoreOS Alpha 1263.0.0 milestone Dec 2, 2016
@hjacobs
Copy link

hjacobs commented Dec 5, 2016

We had a hanging docker ps as well (moby/moby#28889), we could reproduce it by heavily using docker exec. We fixed the issue for us by manually upgrading to Docker 1.13 (zalando-incubator/kubernetes-on-aws#164).

We are now waiting for Docker 1.13 release and inclusion in CoreOS: zalando-incubator/kubernetes-on-aws#167

@seanknox
Copy link

seanknox commented Dec 6, 2016

I was also able to reproduce this issue using https://github.com/crosbymichael/docker-stress on a Kubernetes worker node running CoreOS Stable 1185.3.0.

Running docker_stress_linux_amd64 -k 3s -c 5 --containers 1000: 5 concurrent workers creating/deleting containers, max lifetime of containers = 3s, create up to 1000 containers on an m4.large instance on AWS would leave the Docker daemon unresponsive after about three minutes.

Upgraded to CoreOS Beta 1235.1.0 and I haven't been able to reproduce. Whereas running 5 concurrent docker_stress workers would kill CoreOS Stable after a few minutes, I was able to run with 10 and 15 concurrent workers until test completion using CoreOS Beta.

CoreOS Stable 1185.3.0

kernel: 4.7.3

docker: 1.11.2

CoreOS Beta 1235.1.0

kernel: 4.8.6

docker: 1.12.3

@mward29
Copy link

mward29 commented Dec 7, 2016

Just as an FYI, I'm not entirely sure this is CoreOS specific. We are running into the same issues on CentOS. Its feels more relates to Docker atm but I've been unable to get a dump out of Docker on failure to validate.

@hjacobs
Copy link

hjacobs commented Dec 7, 2016

@mward29 yes, it's a Docker bug and fixed in 1.13 (and will be backported to 1.12 AFAIK, see moby/moby#28889) --- still it should be fixed in CoreOS (by upgrading to fixed Docker as soon as it's released).

@dm0-
Copy link

dm0- commented Dec 14, 2016

We are updating to Docker 1.12.4, which contains the upstream fixes for the docker daemon deadlocks (moby/moby#29095, moby/moby#29141). It will be available in the alpha later this week. You can reopen this if problems persist.

@Bekt
Copy link

Bekt commented Jan 10, 2017

This kept happening to me at about 10% of the time. For non-alpha releases, I solved this by adding a unit that constantly checks docker version and restarts if it hangs.

- name: ping-docker.service
  command: start
  content: |
    [Unit]
    Description=Make sure docker is always running.
    [Service]
    ExecStart=/usr/bin/bash -c "(timeout 2s docker ps > /dev/null) || (systemctl restart docker)"
    Restart=always
    RestartSec=10

Edit: s/version/ps

@saschaarthur

This comment has been minimized.

@MartinPyka
Copy link

MartinPyka commented Mar 6, 2017

How do you stop the stdout of docker logs?
Would love to have reliable method to reproduce the error

@krmayankk
Copy link

krmayankk commented Jul 28, 2017

@hjacobs i am seeing this on 1.12.6 Is this fixed, if yes in which version ?
@dm0- you seem to suggest that its fixed in 1.12.4 , is that right ? since i am seeing in 1.12.6.

docker version
Client:
Version: 1.12.6
API version: 1.24
Go version: go1.6.4
Git commit: 78d1802
Built: Tue Jan 10 20:20:01 2017
OS/Arch: linux/amd64

Server:
Version: 1.12.6
API version: 1.24
Go version: go1.6.4
Git commit: 78d1802
Built: Tue Jan 10 20:20:01 2017
OS/Arch: linux/amd64

strace for docker ps

futex(0xc820062908, FUTEX_WAKE, 1)      = 1
futex(0xc820062908, FUTEX_WAKE, 1)      = 1
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
setsockopt(4, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(4, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, 23) = 0
epoll_create1(EPOLL_CLOEXEC)            = 5
epoll_ctl(5, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3597498464, u64=140156970301536}}) = 0
getsockname(4, {sa_family=AF_LOCAL, NULL}, [2]) = 0
getpeername(4, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
futex(0xc820062908, FUTEX_WAKE, 1)      = 1
read(4, 0xc820349000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(4, "GET /v1.24/containers/json HTTP/"..., 95) = 95
futex(0xc820062d08, FUTEX_WAKE, 1)      = 1
futex(0x132ccc8, FUTEX_WAIT, 0, NULL^CProcess 119740 detached

@piyushkv1
Copy link

I'm also seeing this in docker server version 1.12.6

@deryadorian
Copy link

deryadorian commented Aug 21, 2017

sudo systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/run/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: inactive (dead)
Docs:

after docker ps:

systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/run/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2017-08-21 13:20:50 UTC; 2s ago
Docs: http://docs.docker.com
Main PID: 25815 (dockerd)

@squeed
Copy link

squeed commented Aug 21, 2017

That seems reasonable; docker is a socket-activated service.

@osman-masood

This comment has been minimized.

@ilatypov

This comment has been minimized.

@lucab
Copy link

lucab commented Dec 13, 2019

This ticket is unfortunately starting to attract spurious follow-ups not related to ContainerLinux. I'm thus going to lock this to prevent further unrelated comments.

@coreos coreos locked as off-topic and limited conversation to collaborators Dec 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.