-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lingering processes (and containers) when writing to stdout (was: Unable to launch 1024th instance: bridge 'docker0' : Exchange full) #1320
Comments
@AtnNn you can set the mask even higher for docker0 by running |
We are experiencing the same issue at roughly 1000 containers. Full log is here - log":"lxc-start: failed to attach 'vethfjTmTa' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:02.898739266Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:02.924695451Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:02.924776414Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:02.92479133Z"}{"log":"lxc-start: failed to attach 'veth2wVEtK' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:11.231451083Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:11.260742429Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:11.26081298Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:11.260842453Z"}{"log":"lxc-start: failed to attach 'veth0GELCb' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:20.26200723Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:20.288725026Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:20.288829812Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:20.288847775Z"}{"log":"lxc-start: failed to attach 'vethFCZccx' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:29.177668698Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:29.20109484Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:29.201169778Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:29.20118698Z"}{"log":"lxc-start: failed to attach 'veth1fiB9t' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:33.853148464Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:33.869600611Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:33.869707684Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:33.869725395Z"}{"log":"lxc-start: failed to attach 'vethTWelYW' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:38.17659738Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:42.723254416Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:42.723324767Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:42.723355021Z"}{"log":"lxc-start: failed to attach 'vethjnSD2m' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:47.182822431Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:47.208869656Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:47.208945839Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:47.208964383Z"}{"log":"lxc-start: failed to attach 'vethzeGhZI' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:28:56.166600333Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:28:56.184881084Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:28:56.184938622Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:28:56.185038674Z"}{"log":"lxc-start: failed to attach 'vethYzki9g' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:29:05.249763803Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:29:05.284705579Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:29:05.284776108Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:29:05.284830179Z"}{"log":"lxc-start: failed to attach 'vethosWjFx' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:29:14.169209858Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:29:14.184981241Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:29:14.185063693Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:29:14.18508547Z"}{"log":"lxc-start: failed to attach 'vethu2XuVW' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:29:23.161974191Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:29:23.172984832Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:29:23.173061274Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:29:23.173078736Z"}{"log":"lxc-start: failed to attach 'vethuUhW6l' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:29:32.164502755Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:29:32.18240471Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:29:32.182459505Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:29:32.182515763Z"}{"log":"lxc-start: failed to attach 'vethunYNAL' to the bridge 'docker0' : Exchange full\n","stream":"stderr","time":"2013-08-08T18:29:41.164979142Z"}{"log":"lxc-start: failed to create netdev\n","stream":"stderr","time":"2013-08-08T18:29:41.188897185Z"}{"log":"lxc-start: failed to create the network\n","stream":"stderr","time":"2013-08-08T18:29:41.188983003Z"}{"log":"lxc-start: failed to spawn '5ecb2aa71e1067e8a15969409c2ea9a3a99dca446fc1705f8dc4b0f77da003ba'\n","stream":"stderr","time":"2013-08-08T18:29:41.18902099Z"} |
/cc @jpetazzo |
A little more about our scenario, we are spinning up about 2000 containers/day per server.. the containers run only for a short duration and then our stopped. |
This is because Linux bridges allow a maximum of 1024 ports. See bridge port allocation code, referencing BR_MAX_PORTS. If you don't use STP, you can tweak If you don't want to recompile, you could also:
However, it's weird that you hit the 1024 ports limit if your containers are short-lived. Could you attach the output of |
The problem fixes itself after reboot docker. So it's not happening now, we have 10 containers open. |
Indeed, it looks like the interfaces are not garbage-collected as they should. A couple of extra questions:
|
(For the record: both |
Linux ip-10-0-2-232 3.8.0-19-generic #30-Ubuntu SMP Wed May 1 16:35:23 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux I sent you docker ps and ps faux. We have alot of zombie processes that get created resulting from the starts/stops, but we are assuming that isn't related to this problem. |
Thanks for the log files; this is extremely helpful! The Can you give us some details about the lifecycle? Specifically: how do you terminate containers? Do the process end "normally", or do you |
Thanks Jerome for the direction. We're doing more research and will get On Thu, Aug 8, 2013 at 2:23 PM, Jérôme Petazzoni
|
So far, here's what I've uncovered. When we stop a container, sometimes they do not shutdown cleanly. First a SIGINT is sent to lxc, then a SIGKILL. Both these fail then docker does a SIGKILL on the main process. This causes it to zombie and not release it's ports. It happens sporadically. Our run command is node process which also launches subprocesses like mysql, mongodb and/or apache |
Whenever I do see SIGKILL being sent to process, i also see this in the logs We are using a read-only bind in all containers, could it be failing to unmount and freeze? |
The missing auplink shouldn't cause too much havoc; but if you want to be sure, you can The read-only bind shouldn't be a problem neither. Suggestions of things to try:
The latter will indicate the list of blocked processes. Since the processes seem to be zombie processes, I don't know if it will be helpful, but who knows. Also, are you familiar with moving processes between cgroups? |
I tried upgrading to 3.8.0.27-generic, it didn't help. I'm going to try 3.10 next. |
Ok, looks like I found the problem. We run a daemon program inside each docker container, which launches multiple child processes. We stream the output of those child processes to a log file inside the container. When doing a docker stop, it would try killing our daemon but for some reason the daemon can't close it's write stream to this log file, and the process hangs. We changed the behavior of our daemon so that the child processes directly stream to log files, and that seems to have fixed the problem. |
Extremely interesting.
On Fri, Aug 16, 2013 at 6:32 PM, Yash Kumar [email protected]:
|
Here's the commit that fixed the problem We're writing the daemon in node.js |
@jpetazzo can we close this issue? |
Is this still an issue for people with the new versions of docker? |
yes this is still an issue. we are still seeing the above. around 55 out of 350 have defunct node process. more info here: |
repro steps:
|
@anandkumarpatel according to http://stackoverflow.com/questions/22413563/docker-container-refuses-to-get-killed-after-run-command-turns-into-a-zombie this issue can be closed ? |
O yes, forgot to say close this! |
@anandkumarpatel I'm on ubuntu with
and seeing zombie processes. Is there a way for me to know that the cause illustrated in your SO question is the root cause? |
This just bit me on a CoreOS 607 host running Docker 1.5.0 |
This bug seems to be an issue on CoreOS 607 with Docker 1.5.0. It seems bridges aren't collected when the containers shutdown. Restarting docker seems to be a temporary fix, but please get this fixed. |
@ianblenke @Blystad could you create a new issue for that (possibly referring to this issue)? When reporting the issue, please also provide the information as described in https://github.com/docker/docker/blob/master/CONTRIBUTING.md#reporting-other-issues. Also, please check for existing issues; it's possible that there's an existing (open) issue handling this/ If possible, could you also test on a docker-1.6 release candidate, to see if the problem has been resolved since the 1.5 release? You can find the current release candidates here: #11635 (comment) |
@jpetazzo can you help me to understand why i can only ping 1001 interfaces of 1023 interfaces that are connected to linux bridge. https://stackoverflow.com/questions/45066139/why-i-can-only-ping-1001-interface-out-of-1023-that-are-connected-to-a-linux-bri |
@vijay-rs the stackoverflow question that you linked has been removed. Furthermore–please do not add an unrelated question to an issue that was closed 2 years ago! Your issue is probably totally unrelated, and this issue has been closed. Please open a new issue if needed. Thank you very much! |
The kernel seems to define this limit in
net/bridge/br_private.h
The text was updated successfully, but these errors were encountered: