-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker 1.12 RC3: network connectivity problem between the containers in a service #24496
Comments
ping @mavenugo |
@mavenugo could you please help take a look? My work is being blocked by this issue, and my colleague is running into this same problem. Thank you. |
@ligc sure. Will take a look. |
In my cluster, the docker hosts use the subnet 10.0.0.0/8, which overlaps the docker swarm ingress network 10.255.0.0/16, could this be a problem at all? |
@ligc overlapping subnet could be a problem for kernels < 3.16. But since you are using kernel 4.x (i guess all your 5 nodes are the same kernel). Can you please let us know how (exact command) do you |
HI, The 5 nodes are all running Ubuntu 16.04 with kernel 4.4.0-22 The command I used to init the swarm:
Then I run the following join command on all the 4 worker nodes to join the swarm:
|
@ligc thanks. just curious. The built-in IPAM will dish-out ip-address in sequence. Do you know the reason for the missing .8 & .9 addresses ? Trying to understand this scenario to make sense of the reproduction steps (such as scaling and / or introducing more services in the network). |
Yes, I did ran scale out and scale in several times, and there was another service in the swarm, I think the missing .8 and .9 were used by the other service. |
I just deleted the service and re-deploy it with command "docker service create --replicas 5 --publish 22 --name httpclient liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new", but the containers are still assigned the same addresses 10.255.0.4 10.255.0.5 10.255.0.6 10.255.0.7 10.255.0.10. May this be an indication of the reason of the problem? |
@ligc reusing the same ip-address (after deleting the existing service) is by design. I think the .8 & .9 containers are being used by other services in the network. Do you still see the connectivity issues even after removing and re-deploying the service ? |
@mavenugo after removing and re-deploying the service, there are still 2 ip addresses are not reachable, but the 2 ip addresses are different.
|
I have a similar issue. I've created 3 nodes with |
@mavenugo Thanks, could you please show us detailed commands to run on which nodes to check and open up the ports? |
@mavenugo HI, I am not using the Amazon cloud, the nodes are KVM guests on my local server, I think the security groups do not apply in my configuration? |
I opened up the security groups for any kind of traffic coming from the other hosts and still doesn't work. Also, I found this, which I guess is related: #24377 |
@ligc yes. the security-groups config above is for AWS users (which @ghoranyi is).
|
@ghoranyi I missed your point on pinging the VIP. Yes, pinging VIP will not work (IPVS doesnt support ICMP). can you try pinging either |
@mavenugo I destroyed and re-created the swarm, deployed the services httpclient and httpserver, the network connectivity between the containers worked perfectly, the scale up/scale down did not break the network. Until, I rebooted all the nodes in the swarm(including the swarm manager), the overlay network seems to be messed up after the reboot. Note: the get-ips is a shell script I wrote to get the virtual ip and containers ips of a service through distributed shell.
Before reboot:
After reboot:
We could see that there are several duplicate ip addresses after reboot, and the service vip is changed. |
I was able to re-produce this "reboot" problem on my both docker swarm clusters. I am thinking that the original problem in this bug might be caused by reboot, I did reboot some nodes. |
Experiencing similar in rc4 |
@bitsofinfo are you able to test if the current master resolves this for you? You can download static binaries from https://master.dockerproject.org |
I have the same issue. I used docker-machine with virtualbox driver. How to recreate:
2 create swarm cluster with 2 nodes
on docker2
3 create overlay network
5 "docker exec -it bash" into containers and test connectivity. both containers can resolve each other’s ip and connect to each other. everything is good.
7 test connectivity again. dns resolution no longer works across nodes. tcp is also not working now. Recreating services doesn't help. Recreating overlay network doesn't help (also tried different subnet). Only full recreatinon of the whole swarm helps. I also tested on differend vms and bare metal with the same resuls.
Please tell me if any other information is needed. |
Also experiencing the exact same issue in the 1.12.1 release. Only recreating the entire cluster solves the issue. |
An additional note, on a 3 manager Swarm (still 1.12.1) the issue manifests itself a little differently (maybe another issue altogether). If one of the manager machines fails and later is recovered without user action, it will automatically rejoin the Swarm. Other managers will list the recovered node as a manager, and that node will be able to schedule tasks. However, DNS resolution and published services will frequently be broken on the recovered node. DNS resolution only works for tasks that have at least one scheduled container on the node itself. Published ports stop working altogether, meaning that published services can no longer be accessed on that node. We have found that the easiest way to solve this issue at the moment is to manually exclude the offending node from the Swarm, and add it again, This means demoting the manager, removing the node, forcing the manager itself to leave the Swarm and finally rejoining it. Hope this last part helps working around the issue. |
I'm experiencing the same issue, also using the virtualbox driver and docker 1.12.1. |
We're experiencing the same issue on Upon upgrading the rails service Removing and re-creating the nginx service now makes nginx unreachable at any IP except the machine the container is running on. Restarting daemons, removing all services and recreating networks, or rebooting machines doesn't help. So far I've had to manually recreate the swarm from scratch to resolve. Quite easy for me to replicate, what can I provide to help diagnose? Edit: Looks like same issue as #26563 |
I can re-create the problem (see #26563 ). Did anyone find an easier workaround than the one described by @eb-dteixeira ? Because reboots and server failures may happen at any time and at the moment the result is that some services will be unavailable after reboot and the manual fixing cannot be a good solution... |
@bryanrite in your case, is nginx using individual containers as its "upstream" (i.e. connecting to the containers, and not the virtual-ip address of the service)? If so, then that's expected. The free version of nginx does a DNS lookup when the server is started, and caches the results until the next restart. If the IP address of a container changes (or, in your case, the container is replaced with a new container when doing a Possible solutions for that are:
|
@thaJeztah Thanks for responding so quickly! No, the upstream is just (where
It works fine when I scale up and down as the containers move around the swarm. The problems start only when I |
Thanks for the extra info @bryanrite. Fwiw, I know the networking team is still looking into some issues in this area, and it's likely there's some duplicate issues, but sometimes it's difficult to determine if it's a duplicate or something in a specific setup, as they all may result in the same behavior. |
@thaJeztah No problem. Let me know if there is anything else I can provide to help diagnose as we're currently blocked by this. Thanks! |
@thaJeztah @mrjana In a similar (or identical?) issue I just added some outputs of my iptables and netstat on a cluster. Maybe this can help? #26563 (comment) |
@bryanrite @thaJeztah Regarding nginx: I Think it's really a source of confusion for many. The free nginx indeed does resolve dns on startup(and exits if a specified host name is not available). But as I understand it the internal DNS always has the same ip for a service no matter how many instances there are. Therefore nginx continues to have the correct ip for some time. My guess is that fx. updating a service may cause a renewal of the IP address leaving nginx with an old IP address - and it won't resolve it again without being restarted. Solution will be to use a resolver in nginx' s config. |
@michaelkrog correct, services have a VIP (Virtual IP) by default, and that shouldn't change. However, if a service is created with |
Perhaps we need more documentation about that (also an interesting read is https://github.com/stevvooe/sillyproxy) |
@michaelkrog @thaJeztah Yes, the default nginx using roundrobin DNS directly to the swarm's hosts would be a problem, but to confirm, since I am using the swarm's service name, it would be binding to a VIP that doesn't change, load balancing within swarm itself and from my understanding this should not cause the above issue with nginx? Restarting the nginx container does not resolve the issue, and updating it will actually cause nginx to timeout itself (except on the actual host it is running on -- as explained above)... so I suspect it is a swarm issue with ingress routing rather than a misconfiguration of nginx. Let me know if there is any more information I can provide! Thanks! |
@bryanrite there are some issue still being investigated with swarm networking, and I'm really no networking "expert", so not sure I can personally help further triaging the issue, but @mrjana or @mavenugo should be able to if needed (and better know what to ask for). Thanks for your offer for helping! |
@bryanrite You are most likely having a gossip failure problem. You can confirm this by checking your daemon logs. When gossip fails there is a bug which makes the recovery problematic. moby/libnetwork#1446 has a fix for this. But in general if you are in DO, I would avoid using the public IP to form the swarm. I think if raft and gossip are running in the public network there is either a network congestion or some kind of network throttling going on which causes these flapping. Instead you can enable a private network and form the swarm using the private network. This may help although I can't say for sure. |
@mrjana Thanks for the update! I have been following your discussion on #26563 ... I have tried running on both private and public networks but get the same result. Different from @ascora , I am not load testing or hammering the swarm in any way, so maybe DO is throttling the gossip traffic for some reason? I don't remember offhand if the docker daemon logs had any gossip issues, but I will take a look again, and see if I can run it on the latest master with your libnetwork fix. Thanks! |
@bryanrite @mrjana Just to clarify: I can reproduce the problem quickly with load testing but it originally appeared - and continues to appear - on our normal productive servers as well. It only takes longer to appear if I don't do load testing. Also, for the load testing I used DO but our productive servers are not at DO so I don't think that DO is throttling the traffic. |
@bryanrite Thanks and please let us know what you find from the logs. @ascora If the problem is still happening in your normal production servers is it possible to get daemon logs from there just to make sure you are seeing the same problem? Originally we both thought that it was because of some node restart and if that is what is still triggering the issue in your normal production servers that is a different issue(which is already fixed in docker/docker master btw). Just want to make sure if we are not missing any other issues in the mix. |
@mrjana Would it be useful if I gave you access to a small cluster on DO that is exhibiting the problem? I've been able to reproduce it pretty consistently -- although like @ascora mentioned, sometimes it takes a few If so, I can set up something for you in a day or so -- have an deadline I have to meet first 😄 |
@mrjana I re-created the swarm on them since the 15th but I have attached the (anonymized) logs from the 16th. I created a new service on the 16th and noticed yesterday that it was not reachable from all IPs. However, I don't see the ping problem in the logs from the 16th. I also got the logs from the 15th and I can send them as well and indeed they contain the ping message. But on the 15th I tested so many things that I don't know if this was related to a reboot or not. |
@bryanrite Sure please provide access as that would confirm(or not) if this issue is related to gossip failure. So we can tag this issue accordingly. |
@ascora If what you provided here are the logs from the 16th these also show gossip failure and that is still your problem, although in this case it failed faster with UDP ping retry before TCP pings could timeout. But at the end the effect is the same. You can also see from the logs(from x1.manager.txt) that at the same time gossip is failing, raft also underwent leader re-election and GRPC session towards workers also timed out. All of these processes inside docker are completely independent but they somehow see the same problem which indicates some congestion at some level below these processes. |
I am going to close this as a dupe of #26563 since the symptoms are similar. Feel free to reopen if you think this is a different bug. |
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.):
Docker swarm nodes are 5 KVM guests that connects to the linux bridge
Docker service tasks:
Steps to reproduce the issue:
1 Setup the swarm with 5 nodes
2 Run docker service create --replicas 5 --publish 22 --name httpclient liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new to deploy the service with 5 containers
3 Attach to the containers using docker exec -it <container_id> /bin/bash
4 Run fping on one container to test the network connectivity to the other containers
The ip addresses of these containers are 10.255.0.4 10.255.0.5 10.255.0.6 10.255.0.7 10.255.0.10; through the fping result, it shows that 10.255.0.4 and 10.255.0.7 could not connect to any other container.
Describe the results you received:
Some containers on the overlay network could not connect to the other containers that are on the same overlay network.
Describe the results you expected:
All the containers on the overlay network should be reachable.
Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered: