Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lingering processes (and containers) when writing to stdout (was: Unable to launch 1024th instance: bridge 'docker0' : Exchange full) #1320

Closed
AtnNn opened this issue Jul 27, 2013 · 31 comments
Milestone

Comments

@AtnNn
Copy link

AtnNn commented Jul 27, 2013

$ docker ps | wc -l
1024
$ brctl show docker0 | wc -l
1024
$ docker run base true
lxc-start: failed to attach 'vethHjfSzW' to the bridge 'docker0' : Exchange full
lxc-start: failed to create netdev
lxc-start: failed to create the network
lxc-start: failed to spawn 'aae42e176fa8369f1e327b752eb1e136963274f273053c72599b361b7ffc3a63'
lxc-start: No such file or directory - failed to remove cgroup '/sys/fs/cgroup//lxc/aae42e176fa8369f1e327b752eb1e136963274f273053c72599b361b7ffc3a63'

The kernel seems to define this limit in net/bridge/br_private.h

#define BR_PORT_BITS 10
#define BR_MAX_PORTS (1<<BR_PORT_BITS)
@keeb-zz
Copy link
Contributor

keeb-zz commented Jul 27, 2013

@AtnNn you can set the mask even higher for docker0 by running ifconfig docker0 netmask 255.0.0.0

@AtnNn
Copy link
Author

AtnNn commented Jul 27, 2013

Thanks for the suggestion @keeb, but it did not work.

The netmask was already 255.255.0.0 (thanks to #1265), which would imply a limit of over 65000 ip adresses.

@ykumar6
Copy link

ykumar6 commented Aug 8, 2013

We are experiencing the same issue at roughly 1000 containers.
Our netmask is set to 255.255.0.0

Full log is here -

log":"lxc-start: failed to attach 'vethfjTmTa' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:02.898739266Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:02.924695451Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:02.924776414Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:02.92479133Z"}{"log":"lxc-start: failed to attach 'veth2wVEtK' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:11.231451083Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:11.260742429Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:11.26081298Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:11.260842453Z"}{"log":"lxc-start: failed to attach 'veth0GELCb' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:20.26200723Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:20.288725026Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:20.288829812Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:20.288847775Z"}{"log":"lxc-start: failed to attach 'vethFCZccx' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:29.177668698Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:29.20109484Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:29.201169778Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:29.20118698Z"}{"log":"lxc-start: failed to attach 'veth1fiB9t' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:33.853148464Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:33.869600611Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:33.869707684Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:33.869725395Z"}{"log":"lxc-start: failed to attach 'vethTWelYW' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:38.17659738Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:42.723254416Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:42.723324767Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:42.723355021Z"}{"log":"lxc-start: failed to attach 'vethjnSD2m' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:47.182822431Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:47.208869656Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:47.208945839Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:47.208964383Z"}{"log":"lxc-start: failed to attach 'vethzeGhZI' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:56.166600333Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:56.184881084Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:56.184938622Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:56.185038674Z"}{"log":"lxc-start: failed to attach 'vethYzki9g' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:29:05.249763803Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:29:05.284705579Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:29:05.284776108Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:29:05.284830179Z"}{"log":"lxc-start: failed to attach 'vethosWjFx' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:29:14.169209858Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:29:14.184981241Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:29:14.185063693Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:29:14.18508547Z"}{"log":"lxc-start: failed to attach 'vethu2XuVW' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:29:23.161974191Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:29:23.172984832Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:29:23.173061274Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:29:23.173078736Z"}{"log":"lxc-start: failed to attach 'vethuUhW6l' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:29:32.164502755Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:29:32.18240471Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:29:32.182459505Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:29:32.182515763Z"}{"log":"lxc-start: failed to attach 'vethunYNAL' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:29:41.164979142Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:29:41.188897185Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:29:41.188983003Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:29:41.18902099Z"}

@creack
Copy link
Contributor

creack commented Aug 8, 2013

/cc @jpetazzo

@ykumar6
Copy link

ykumar6 commented Aug 8, 2013

A little more about our scenario, we are spinning up about 2000 containers/day per server.. the containers run only for a short duration and then our stopped.

@jpetazzo
Copy link
Contributor

jpetazzo commented Aug 8, 2013

This is because Linux bridges allow a maximum of 1024 ports.

See bridge port allocation code, referencing BR_MAX_PORTS.

If you don't use STP, you can tweak BR_PORT_BITS and recompile your kernel. You can go up to 16 (even though I wonder what happens where there is zero bit left for STP priority), which would translate to 64K ports.

If you don't want to recompile, you could also:

  • use openvswitch instead (IIRC, it already allows 64K ports);
  • not allocate a network interface when you don't need it (but I guess that your containers need network connectivity);
  • run multiple instances of docker in parallel, each using a different bridge, API endpoint, and possibly graph directory (this is not supported, but due to your particular use case, it could be worth investigating).

However, it's weird that you hit the 1024 ports limit if your containers are short-lived. Could you attach the output of brctl show and ip link ls?

@ykumar6
Copy link

ykumar6 commented Aug 8, 2013

The problem fixes itself after reboot docker. So it's not happening now, we have 10 containers open.
The logs are massive (I am assuming there is a leak somewhere), emailed them to you instead as github won't let me attach

@jpetazzo
Copy link
Contributor

jpetazzo commented Aug 8, 2013

Indeed, it looks like the interfaces are not garbage-collected as they should.

A couple of extra questions:

  • which kernel version are you running? (I doubt it's related, but well, let's gather as much info as we can!)
  • can you also attach (or send me) the output of docker ps, and maybe ps faux (I expect that the latter will be even bigger)? I would like to be sure that the containers are completely gone...

@jpetazzo
Copy link
Contributor

jpetazzo commented Aug 8, 2013

(For the record: both brctl show and ip link ls were showing 990-1000 interfaces)

@ykumar6
Copy link

ykumar6 commented Aug 8, 2013

Linux ip-10-0-2-232 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

I sent you docker ps and ps faux.

We have alot of zombie processes that get created resulting from the starts/stops, but we are assuming that isn't related to this problem.

@jpetazzo
Copy link
Contributor

jpetazzo commented Aug 8, 2013

Thanks for the log files; this is extremely helpful!

The ps faux output lists a handful of lxc-start processes, and 991 node processes in defunct state. I believe that this is the most significant clue! For some reason, those processes are not reaped and remain in zombie state, and therefore, their network resources are not garbage collected.

Can you give us some details about the lifecycle? Specifically: how do you terminate containers? Do the process end "normally", or do you docker kill the containers?

@ykumar6
Copy link

ykumar6 commented Aug 9, 2013

Thanks Jerome for the direction. We're doing more research and will get
back to you when we have more info

On Thu, Aug 8, 2013 at 2:23 PM, Jérôme Petazzoni
[email protected]:

Thanks for the log files; this is extremely helpful!

The ps faux output lists a handful of lxc-start processes, and 991 node
processes in defunct state. I believe that this is the most significant
clue! For some reason, those processes are not reaped and remain in zombie
state, and therefore, their network resources are not garbage collected.

Can you give us some details about the lifecycle? Specifically: how do you
terminate containers? Do the process end "normally", or do you docker killthe containers?


Reply to this email directly or view it on GitHubhttps://github.com//issues/1320#issuecomment-22357378
.

@ykumar6
Copy link

ykumar6 commented Aug 16, 2013

So far, here's what I've uncovered.

When we stop a container, sometimes they do not shutdown cleanly. First a SIGINT is sent to lxc, then a SIGKILL. Both these fail then docker does a SIGKILL on the main process.

This causes it to zombie and not release it's ports. It happens sporadically.
We can replicate the issue easily if we docker start and immediately docker stop the container

Our run command is node process which also launches subprocesses like mysql, mongodb and/or apache
Are there additional logs I can look at? containerid-json.log is empty

@ykumar6
Copy link

ykumar6 commented Aug 16, 2013

Whenever I do see SIGKILL being sent to process, i also see this in the logs
couldn't run auplink before unmount: exec: "auplink": executable file not found in $PATH

We are using a read-only bind in all containers, could it be failing to unmount and freeze?

@jpetazzo
Copy link
Contributor

The missing auplink shouldn't cause too much havoc; but if you want to be sure, you can apt-get install aufs-tools, it will make this error go away.

The read-only bind shouldn't be a problem neither.

Suggestions of things to try:

  • does the problem also happens if you use the kernel in kernel-image-3.8.0-27-generic?
  • does it also happen if you try a 3.10 kernel?
  • when it happens, can you "echo w | sudo tee /proc/sysrq-trigger" and attach the kernel log?

The latter will indicate the list of blocked processes. Since the processes seem to be zombie processes, I don't know if it will be helpful, but who knows.

Also, are you familiar with moving processes between cgroups?

@ykumar6
Copy link

ykumar6 commented Aug 16, 2013

I tried upgrading to 3.8.0.27-generic, it didn't help. I'm going to try 3.10 next.
In the meantime, I've emailed you the kern.log

@ykumar6
Copy link

ykumar6 commented Aug 17, 2013

Ok, looks like I found the problem.

We run a daemon program inside each docker container, which launches multiple child processes. We stream the output of those child processes to a log file inside the container.

When doing a docker stop, it would try killing our daemon but for some reason the daemon can't close it's write stream to this log file, and the process hangs.

We changed the behavior of our daemon so that the child processes directly stream to log files, and that seems to have fixed the problem.

@jpetazzo
Copy link
Contributor

Extremely interesting.
Just to make sure that I understand:

  • if you write to a "normal" log file (on AUFS, I suppose), the problem
    appears;
  • if you write to stdout, everything works properly
    Is that correct?

On Fri, Aug 16, 2013 at 6:32 PM, Yash Kumar [email protected]:

Ok, looks like I found the problem.

We run a daemon program inside each docker container, which launches
multiple child processes. We stream the output of those child processes to
a log file inside the container.

When doing a docker stop, it would try killing our daemon but for some
reason the daemon can't close it's write stream to this log file, and the
process hangs.

We changed the behavior of our daemon so that the child processes directly
stream to log files, and that seems to have fixed the problem.

?
Reply to this email directly or view it on GitHubhttps://github.com//issues/1320#issuecomment-22803233
.

@ykumar6
Copy link

ykumar6 commented Aug 19, 2013

Here's the commit that fixed the problem
Runnable/dockworker@5c59428

We're writing the daemon in node.js

@creack
Copy link
Contributor

creack commented Dec 6, 2013

@jpetazzo can we close this issue?

@crosbymichael
Copy link
Contributor

Is this still an issue for people with the new versions of docker?

@anandkumarpatel
Copy link
Contributor

yes this is still an issue. we are still seeing the above. around 55 out of 350 have defunct node process. more info here:
http://stackoverflow.com/questions/22413563/docker-container-refuses-to-get-killed-after-run-command-turns-into-a-zombie

@anandkumarpatel
Copy link
Contributor

repro steps:

#!/bin/bash
CNT=0
while true
do 
  echo $CNT
  DOCK=$(sudo docker run -d -t anandkumarpatel/zombie_bug ./node index.js)
  sleep 60 && sudo docker stop $DOCK > out.log &
  sleep 1
  CNT=$(($CNT+1))
  if [[ "$CNT" == "50" ]]; then
    exit
  fi
done

@jpetazzo jpetazzo added the bug label Mar 24, 2014
@crosbymichael crosbymichael added this to the 1.0 milestone May 16, 2014
@crosbymichael crosbymichael self-assigned this May 27, 2014
@vieux
Copy link
Contributor

vieux commented May 27, 2014

@anandkumarpatel
Copy link
Contributor

O yes, forgot to say close this!

@vieux vieux closed this as completed May 28, 2014
@vincentwoo
Copy link
Contributor

@anandkumarpatel I'm on ubuntu with

vwoo@tty:~$ uname -r
3.11.0-19-generic

and seeing zombie processes. Is there a way for me to know that the cause illustrated in your SO question is the root cause?

@ianblenke
Copy link

This just bit me on a CoreOS 607 host running Docker 1.5.0

@acr92
Copy link

acr92 commented Apr 8, 2015

This bug seems to be an issue on CoreOS 607 with Docker 1.5.0.

It seems bridges aren't collected when the containers shutdown. Restarting docker seems to be a temporary fix, but please get this fixed.

@thaJeztah
Copy link
Member

@ianblenke @Blystad could you create a new issue for that (possibly referring to this issue)?

When reporting the issue, please also provide the information as described in https://github.com/docker/docker/blob/master/CONTRIBUTING.md#reporting-other-issues.

Also, please check for existing issues; it's possible that there's an existing (open) issue handling this/

If possible, could you also test on a docker-1.6 release candidate, to see if the problem has been resolved since the 1.5 release?

You can find the current release candidates here: #11635 (comment)

@bv-vijay
Copy link

@jpetazzo can you help me to understand why i can only ping 1001 interfaces of 1023 interfaces that are connected to linux bridge. https://stackoverflow.com/questions/45066139/why-i-can-only-ping-1001-interface-out-of-1023-that-are-connected-to-a-linux-bri

@jpetazzo
Copy link
Contributor

@vijay-rs the stackoverflow question that you linked has been removed. Furthermore–please do not add an unrelated question to an issue that was closed 2 years ago! Your issue is probably totally unrelated, and this issue has been closed. Please open a new issue if needed. Thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests