Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker 1.12 RC3: network connectivity problem between the containers in a service #24496

Closed
ligc opened this issue Jul 11, 2016 · 49 comments
Closed
Assignees
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P1 Important: P1 issues are a top priority and a must-have for the next release. version/1.12
Milestone

Comments

@ligc
Copy link

ligc commented Jul 11, 2016

Output of docker version:

root@c910f04x19k03:~# docker version
Client:
 Version:      1.12.0-rc3
 API version:  1.24
 Go version:   go1.6.2
 Git commit:   91e29e8
 Built:        Sat Jul  2 00:38:44 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.0-rc3
 API version:  1.24
 Go version:   go1.6.2
 Git commit:   91e29e8
 Built:        Sat Jul  2 00:38:44 2016
 OS/Arch:      linux/amd64
root@c910f04x19k03:~# 

Output of docker info:

root@c910f04x19k03:~# docker info
Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 5
Server Version: 1.12.0-rc3
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 21
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: null host overlay bridge
Swarm: active
 NodeID: 07wntcacagsx697lbxwglgabu
 IsManager: Yes
 Managers: 1
 Nodes: 5
 CACertHash: sha256:e6e343bf771b0d6ee561deea4effa386d3c4f73208a091f782dd6bc526fbd0fe
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-22-generic
Operating System: Ubuntu 16.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 3.859 GiB
Name: c910f04x19k03
ID: HF6O:3RRR:UUWB:WRJE:ZEB2:E7JN:WFRE:MEGC:UKNQ:YA55:L5LY:XG75
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 127.0.0.0/8
root@c910f04x19k03:~# 

Additional environment details (AWS, VirtualBox, physical, etc.):
Docker swarm nodes are 5 KVM guests that connects to the linux bridge

root@c910f04x19k03:~# docker node ls
ID                           HOSTNAME       MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
07wntcacagsx697lbxwglgabu *  c910f04x19k03  Accepted    Ready   Active        Leader
4l4154pxnax251zxhq168ma9t    c910f04x19k06  Accepted    Ready   Active        
4mf6v47xxg5k1ife7j45tx4bx    c910f04x19k07  Accepted    Ready   Active        
8z8ex1v57urkh0jsmbbmlhvgw    c910f04x19k04  Accepted    Ready   Active        
ewv3w07kjncrbi6zmgk1ywa4b    c910f04x19k05  Accepted    Ready   Active        
root@c910f04x19k03:~# 

Docker service tasks:

root@c910f04x19k03:~# docker service tasks httpclient
ID                         NAME          SERVICE     IMAGE                                                   LAST STATE          DESIRED STATE  NODE
5b5aj5a4mtklxodmyje0moypj  httpclient.1  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 12 minutes  Running        c910f04x19k03
7ngcg71fi4tji0tuyia150o19  httpclient.2  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 12 minutes  Running        c910f04x19k07
am57ovhrxatzdbvesbldvtdym  httpclient.3  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 12 minutes  Running        c910f04x19k05
1ov8zznmytamghkg88bm9a6e2  httpclient.4  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 12 minutes  Running        c910f04x19k06
44rag1o41yeyj877auf6y5y30  httpclient.5  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 12 minutes  Running        c910f04x19k04
root@c910f04x19k03:~# 

Steps to reproduce the issue:

1 Setup the swarm with 5 nodes

root@c910f04x19k03:~# docker node ls
ID                           HOSTNAME       MEMBERSHIP  STATUS  AVAILABILITY  MANAGER STATUS
07wntcacagsx697lbxwglgabu *  c910f04x19k03  Accepted    Ready   Active        Leader
4l4154pxnax251zxhq168ma9t    c910f04x19k06  Accepted    Ready   Active        
4mf6v47xxg5k1ife7j45tx4bx    c910f04x19k07  Accepted    Ready   Active        
8z8ex1v57urkh0jsmbbmlhvgw    c910f04x19k04  Accepted    Ready   Active        
ewv3w07kjncrbi6zmgk1ywa4b    c910f04x19k05  Accepted    Ready   Active        
root@c910f04x19k03:~# 

2 Run docker service create --replicas 5 --publish 22 --name httpclient liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new to deploy the service with 5 containers

root@c910f04x19k03:~# docker service create --replicas 5  --publish 22 --name httpclient liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new
40b52xd74ds9emyginbusuq9r
root@c910f04x19k03:~#
root@c910f04x19k03:~# docker service tasks httpclient
ID                         NAME          SERVICE     IMAGE                                                   LAST STATE          DESIRED STATE  NODE
5b5aj5a4mtklxodmyje0moypj  httpclient.1  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 25 minutes  Running        c910f04x19k03
7ngcg71fi4tji0tuyia150o19  httpclient.2  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 25 minutes  Running        c910f04x19k07
am57ovhrxatzdbvesbldvtdym  httpclient.3  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 25 minutes  Running        c910f04x19k05
1ov8zznmytamghkg88bm9a6e2  httpclient.4  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 25 minutes  Running        c910f04x19k06
44rag1o41yeyj877auf6y5y30  httpclient.5  httpclient  liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new  Running 25 minutes  Running        c910f04x19k04
root@c910f04x19k03:~# 

3 Attach to the containers using docker exec -it <container_id> /bin/bash

4 Run fping on one container to test the network connectivity to the other containers
The ip addresses of these containers are 10.255.0.4 10.255.0.5 10.255.0.6 10.255.0.7 10.255.0.10; through the fping result, it shows that 10.255.0.4 and 10.255.0.7 could not connect to any other container.

root@0ba8a9bd2ce1:/# fping 10.255.0.4 10.255.0.5 10.255.0.6 10.255.0.7 10.255.0.10
10.255.0.5 is alive
10.255.0.6 is alive
10.255.0.10 is alive
10.255.0.4 is unreachable
10.255.0.7 is unreachable
root@0ba8a9bd2ce1:/# 

Describe the results you received:
Some containers on the overlay network could not connect to the other containers that are on the same overlay network.

Describe the results you expected:
All the containers on the overlay network should be reachable.

Additional information you deem important (e.g. issue happens only occasionally):

@cpuguy83 cpuguy83 added this to the 1.12.0 milestone Jul 11, 2016
@cpuguy83
Copy link
Member

ping @mavenugo

@ligc
Copy link
Author

ligc commented Jul 12, 2016

@mavenugo could you please help take a look? My work is being blocked by this issue, and my colleague is running into this same problem. Thank you.

@mavenugo
Copy link
Contributor

@ligc sure. Will take a look.

@ligc
Copy link
Author

ligc commented Jul 12, 2016

In my cluster, the docker hosts use the subnet 10.0.0.0/8, which overlaps the docker swarm ingress network 10.255.0.0/16, could this be a problem at all?

@mavenugo
Copy link
Contributor

@ligc overlapping subnet could be a problem for kernels < 3.16. But since you are using kernel 4.x (i guess all your 5 nodes are the same kernel).

Can you please let us know how (exact command) do you init your swarm ?

@ligc
Copy link
Author

ligc commented Jul 12, 2016

HI,

The 5 nodes are all running Ubuntu 16.04 with kernel 4.4.0-22

The command I used to init the swarm:

docker swarm init --listen-addr 10.4.19.3:2377

Then I run the following join command on all the 4 worker nodes to join the swarm:

docker swarm join 10.4.19.3:2377

@mavenugo
Copy link
Contributor

@ligc thanks. just curious. The built-in IPAM will dish-out ip-address in sequence. Do you know the reason for the missing .8 & .9 addresses ? Trying to understand this scenario to make sense of the reproduction steps (such as scaling and / or introducing more services in the network).

@ligc
Copy link
Author

ligc commented Jul 12, 2016

Yes, I did ran scale out and scale in several times, and there was another service in the swarm, I think the missing .8 and .9 were used by the other service.

@ligc
Copy link
Author

ligc commented Jul 12, 2016

I just deleted the service and re-deploy it with command "docker service create --replicas 5 --publish 22 --name httpclient liguangcheng/ubuntu-16.04-x86_64-apache2-benchmark-new", but the containers are still assigned the same addresses 10.255.0.4 10.255.0.5 10.255.0.6 10.255.0.7 10.255.0.10. May this be an indication of the reason of the problem?

@mavenugo
Copy link
Contributor

@ligc reusing the same ip-address (after deleting the existing service) is by design. I think the .8 & .9 containers are being used by other services in the network.

Do you still see the connectivity issues even after removing and re-deploying the service ?

@ligc
Copy link
Author

ligc commented Jul 12, 2016

@mavenugo after removing and re-deploying the service, there are still 2 ip addresses are not reachable, but the 2 ip addresses are different.

root@f85e48d0a79b:/# fping 10.255.0.4 10.255.0.5 10.255.0.6 10.255.0.7 10.255.0.10
10.255.0.4 is alive
10.255.0.7 is alive
10.255.0.10 is alive
10.255.0.5 is unreachable
10.255.0.6 is unreachable
root@f85e48d0a79b:/# 

@ghoranyi
Copy link

I have a similar issue. I've created 3 nodes with docker-machine on AWS. Enabled swarm mode with docker swarm init and docker swarm join .... I've created an overlay network and a service, which is connected to that network. If I try to ping one container from another one (in the same service, but on a different node) with the VIP address on the overlay network, I get Destination Host Unreachable.

@mavenugo
Copy link
Contributor

@ghoranyi @ligc I cannot think of anything obvious other than the security-groups affecting the control/data-plane. Can you please open up the appropriate ports (4789/udp , 7946/udp, 7946/tcp)

@ligc
Copy link
Author

ligc commented Jul 12, 2016

@mavenugo Thanks, could you please show us detailed commands to run on which nodes to check and open up the ports?

@mavenugo
Copy link
Contributor

@ligc
Copy link
Author

ligc commented Jul 12, 2016

@mavenugo HI, I am not using the Amazon cloud, the nodes are KVM guests on my local server, I think the security groups do not apply in my configuration?

@ghoranyi
Copy link

ghoranyi commented Jul 12, 2016

I opened up the security groups for any kind of traffic coming from the other hosts and still doesn't work. Also, I found this, which I guess is related: #24377

@mavenugo
Copy link
Contributor

@ligc yes. the security-groups config above is for AWS users (which @ghoranyi is).
In your case, can you do further investigation and narrow down if the issue is seen only on particular nodes or specifically the reproduction steps that made you notice this issue ?

  • create a new overlay network (docker network create -d overlay test)
  • launch services (with appropriate replicas)
  • capture the docker network inspect test in each of the nodes
  • perform the ping operation between these containers
  • scale up / scale down and repeat step 3 & step 4 till you hit the issue ?
    (and confirm if the issue is seen on particular node and identify what is different in those nodes ?)

@mavenugo
Copy link
Contributor

@ghoranyi I missed your point on pinging the VIP. Yes, pinging VIP will not work (IPVS doesnt support ICMP). can you try pinging either tasks.{service-name} or individual container-ip ?
Or if your service is exposing a TCP or UDP port, you can try to access the VIP via any L4+ access such as nc or tools like https://github.com/tyru/srvtools

@ligc
Copy link
Author

ligc commented Jul 13, 2016

@mavenugo I destroyed and re-created the swarm, deployed the services httpclient and httpserver, the network connectivity between the containers worked perfectly, the scale up/scale down did not break the network. Until, I rebooted all the nodes in the swarm(including the swarm manager), the overlay network seems to be messed up after the reboot.

Note: the get-ips is a shell script I wrote to get the virtual ip and containers ips of a service through distributed shell.

[root@c910f02c01p02 ipvs]# cat get-ips 
#!/bin/bash

service=$1
vip=`xdsh c910f04x19k03 docker service inspect $service | grep "Addr" | awk '{print $3}'`
echo "Virtual IP is: " $vip
xdsh c910f04x19k03 docker service tasks $service | sed 1d | awk '{print $2,$3,$NF}' > aa.txt
cat aa.txt | while read line; do
    node=`echo $line | awk '{print $3}'`
    cid=`echo $line | awk '{print $2}'`.`echo $line | awk '{print $1}'`
    echo -n "$cid    "
    echo -n "$node    "
    ip=`xdsh $node docker exec $cid ifconfig | grep "10.255" | awk '{print $3}' | awk -F':' '{print $2}'`
    echo $ip
done
[root@c910f02c01p02 ipvs]#

Before reboot:

[root@c910f02c01p02 ipvs]# ./get-ips httpclient
Virtual IP is:  "10.255.0.8/16"
httpclient.1.eeazk63xx9ifugi0mdvqg5pdh    c910f04x19k03    10.255.0.9
httpclient.2.1v9skx9ex7dztl1brf9et094n    c910f04x19k06    10.255.0.10
httpclient.3.duftc0uu289r3v93ohaosf5r4    c910f04x19k07    10.255.0.11
httpclient.4.ehoky1m7x0fuy7jaxhjlcs7ow    c910f04x19k05    10.255.0.12
httpclient.5.3oq78akvwyi04boh7ml4ht8ut    c910f04x19k04    10.255.0.13
[root@c910f02c01p02 ipvs]# ./get-ips httpserver
Virtual IP is:  "10.255.0.14/16"
httpserver.1.7pcxtsu2fgm3zp1klf4prync1    c910f04x19k06    10.255.0.18
httpserver.2.507jzs3wkrcgrkb7q4y8rn8z5    c910f04x19k07    10.255.0.19
httpserver.3.4q3nnyr28af9ur2hw0mz88spe    c910f04x19k04    10.255.0.15
httpserver.4.axo7fmw004aqsady92ze7ehop    c910f04x19k03    10.255.0.16
httpserver.5.4xba5lkvxidfn0oxk7f90elus    c910f04x19k05    10.255.0.17
[root@c910f02c01p02 ipvs]#

After reboot:

[root@c910f02c01p02 ipvs]# ./get-ips httpserver
Virtual IP is:  "10.255.0.8/16"
httpserver.1.baz74f1t18ngis8vfgirs3sgi    c910f04x19k06    10.255.0.10
httpserver.2.dx9gz9dl90clmk8sgyceuhta8    c910f04x19k05    10.255.0.17
httpserver.3.4kb92su1g2tgvopupeytj8jef    c910f04x19k04    10.255.0.10
httpserver.4.7ko5vnlrmg8cpn8lnkbbby7os    c910f04x19k03    10.255.0.12
httpserver.5.ek1ybhyf96zh412fzwmgjfyh9    c910f04x19k05    10.255.0.9
[root@c910f02c01p02 ipvs]# ./get-ips httpclient
Virtual IP is:  "10.255.0.2/16"
httpclient.1.9r24q0i8q3w8c9xg6em222yfz    c910f04x19k07    10.255.0.16
httpclient.2.1gig6tceslpfzha9bqaw3wubk    c910f04x19k06    10.255.0.14
httpclient.3.9n8w6hrze4o8josyepu95xrk3    c910f04x19k03    10.255.0.11
httpclient.4.cdzjn63iwxyve8dofflxwvjjj    c910f04x19k07    10.255.0.10
httpclient.5.6kj4iisr6me3s05k0c9qn40kz    c910f04x19k04    10.255.0.9
[root@c910f02c01p02 ipvs]# 

We could see that there are several duplicate ip addresses after reboot, and the service vip is changed.

@ligc
Copy link
Author

ligc commented Jul 13, 2016

I was able to re-produce this "reboot" problem on my both docker swarm clusters. I am thinking that the original problem in this bug might be caused by reboot, I did reboot some nodes.

@mavenugo
Copy link
Contributor

@ligc yes. that seems to be the problem. Infact am currently chasing after the same issue while debugging #24486. So either this is a dupe of #24486 or related. We will address this issue shortly.

Thanks for the detailed analysis.

@bitsofinfo
Copy link

Experiencing similar in rc4

@thaJeztah
Copy link
Member

@bitsofinfo are you able to test if the current master resolves this for you? You can download static binaries from https://master.dockerproject.org

@stszap
Copy link

stszap commented Aug 25, 2016

I have the same issue. I used docker-machine with virtualbox driver. How to recreate:
1 create 2 vms

docker-machine create -d virtualbox docker1
docker-machine create -d virtualbox docker2

2 create swarm cluster with 2 nodes
on doker1

docker swarm init --advertise-addr 192.168.99.100

on docker2

docker swarm join --token SWMTKN-1-05mip8t6db1y8dc58b7wg0c6mweadx8v5w691if7zspii6zclt-c6dr1l6vkim5gn8ukqhq1eq26 192.168.99.100:2377

3 create overlay network
docker network create -d overlay testnet
4 create 2 services

docker service create --name mongo1 --network testnet --constraint "node.hostname == docker1" mongo
docker service create --name mongo2 --network testnet --constraint "node.hostname == docker2" mongo

5 "docker exec -it bash" into containers and test connectivity. both containers can resolve each other’s ip and connect to each other. everything is good.
6 restart manager docker

sudo /etc/init.d/docker stop
sudo /etc/init.d/docker start

7 test connectivity again. dns resolution no longer works across nodes. tcp is also not working now.

Recreating services doesn't help. Recreating overlay network doesn't help (also tried different subnet). Only full recreatinon of the whole swarm helps. I also tested on differend vms and bare metal with the same resuls.

docker version
Client:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 17:52:38 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.1
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   23cf638
 Built:        Thu Aug 18 17:52:38 2016
 OS/Arch:      linux/amd64

Please tell me if any other information is needed.

@eb-dteixeira
Copy link

Also experiencing the exact same issue in the 1.12.1 release. Only recreating the entire cluster solves the issue.

@eb-dteixeira
Copy link

An additional note, on a 3 manager Swarm (still 1.12.1) the issue manifests itself a little differently (maybe another issue altogether).

If one of the manager machines fails and later is recovered without user action, it will automatically rejoin the Swarm. Other managers will list the recovered node as a manager, and that node will be able to schedule tasks.

However, DNS resolution and published services will frequently be broken on the recovered node. DNS resolution only works for tasks that have at least one scheduled container on the node itself. Published ports stop working altogether, meaning that published services can no longer be accessed on that node.

We have found that the easiest way to solve this issue at the moment is to manually exclude the offending node from the Swarm, and add it again, This means demoting the manager, removing the node, forcing the manager itself to leave the Swarm and finally rejoining it. Hope this last part helps working around the issue.

@flovouin
Copy link

flovouin commented Sep 9, 2016

I'm experiencing the same issue, also using the virtualbox driver and docker 1.12.1.
I needed to manually restart the manager node. Once restarted I can scale services from it, and it correctly lists networks, nodes, services, etc. However DNS and VIP are broken across nodes. Note that I can still access local containers through the service name and the corresponding VIP, but it will only distribute requests between containers run on the restarted manager node. Running nslookup tasks.<service> has the corresponding behaviour of only listing local containers.

@bryanrite
Copy link

bryanrite commented Sep 15, 2016

We're experiencing the same issue on 1.12.1, build 23cf638. 6-node swarm on Digital Ocean Debian Kernel 4.6 (3-master, 3-worker), an overlay network with 1 nginx container and several rails containers. Initially works great, can hit the IP of any node, will go through the nginx container and nginx will load balance across the rails containers. No problems.

Upon upgrading the rails service docker service update --with-registry-auth --image our_team/our_image@sha256:1234 web, nginx starts to receive upstream timeouts. I can still hit nginx from any IP on the swarm, but it times out trying to get to the upstream rails app (all of which are booted and listening). All docker inspects, netstat, iptables, and node status etc. look fine.

Removing and re-creating the nginx service now makes nginx unreachable at any IP except the machine the container is running on. Restarting daemons, removing all services and recreating networks, or rebooting machines doesn't help. So far I've had to manually recreate the swarm from scratch to resolve.

Quite easy for me to replicate, what can I provide to help diagnose?

Edit: Looks like same issue as #26563

@SvenAbels
Copy link

I can re-create the problem (see #26563 ).

Did anyone find an easier workaround than the one described by @eb-dteixeira ? Because reboots and server failures may happen at any time and at the moment the result is that some services will be unavailable after reboot and the manual fixing cannot be a good solution...

@thaJeztah
Copy link
Member

@bryanrite in your case, is nginx using individual containers as its "upstream" (i.e. connecting to the containers, and not the virtual-ip address of the service)? If so, then that's expected. The free version of nginx does a DNS lookup when the server is started, and caches the results until the next restart. If the IP address of a container changes (or, in your case, the container is replaced with a new container when doing a service update), it will keep on trying to use the IP-addresses of the "old" containers. You can find information about that in this blog post: https://tenzer.dk/nginx-with-dynamic-upstreams/ (it's not a bug / issue in Docker, but a deliberate limitation in the free version of nginx)

Possible solutions for that are:

  • restart the nginx proxy after doing a service update
  • use the paid-for "nginx plus" version, which does support dynamic DNS updates
  • instead of using individual containers as upstream, use the built-on load balancer of Swarm Mode, and use the service name as upstream. The service has a "static" (virtual) IP address, so updating the service will not result in changes of that. Also, by doing so, you make optimum use of the built-in features of Swarm mode (load balancing)

@bryanrite
Copy link

bryanrite commented Sep 15, 2016

@thaJeztah Thanks for responding so quickly!

No, the upstream is just (where web is the docker service name):

upstream rails {
  server web:3000;
}

location / {
  ...
  proxy_pass http://rails;
}

It works fine when I scale up and down as the containers move around the swarm. The problems start only when I docker service upgrade on the rails container.

@thaJeztah thaJeztah modified the milestones: 1.12.2, 1.12.0 Sep 15, 2016
@thaJeztah thaJeztah added the kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. label Sep 15, 2016
@thaJeztah
Copy link
Member

Thanks for the extra info @bryanrite. Fwiw, I know the networking team is still looking into some issues in this area, and it's likely there's some duplicate issues, but sometimes it's difficult to determine if it's a duplicate or something in a specific setup, as they all may result in the same behavior.

@bryanrite
Copy link

@thaJeztah No problem. Let me know if there is anything else I can provide to help diagnose as we're currently blocked by this. Thanks!

@SvenAbels
Copy link

@thaJeztah @mrjana In a similar (or identical?) issue I just added some outputs of my iptables and netstat on a cluster. Maybe this can help? #26563 (comment)

@michaelkrog
Copy link

michaelkrog commented Sep 17, 2016

@bryanrite @thaJeztah Regarding nginx: I Think it's really a source of confusion for many. The free nginx indeed does resolve dns on startup(and exits if a specified host name is not available). But as I understand it the internal DNS always has the same ip for a service no matter how many instances there are. Therefore nginx continues to have the correct ip for some time.

My guess is that fx. updating a service may cause a renewal of the IP address leaving nginx with an old IP address - and it won't resolve it again without being restarted.

Solution will be to use a resolver in nginx' s config.

@thaJeztah
Copy link
Member

@michaelkrog correct, services have a VIP (Virtual IP) by default, and that shouldn't change. However, if a service is created with dnsrr (DNS round robin) mode, there's no Virtual IP, and resolving the service's name will return IP addresses of individual containers backing the service (and those are not fixed).

@thaJeztah
Copy link
Member

Perhaps we need more documentation about that (also an interesting read is https://github.com/stevvooe/sillyproxy)

@bryanrite
Copy link

@michaelkrog @thaJeztah Yes, the default nginx using roundrobin DNS directly to the swarm's hosts would be a problem, but to confirm, since I am using the swarm's service name, it would be binding to a VIP that doesn't change, load balancing within swarm itself and from my understanding this should not cause the above issue with nginx?

Restarting the nginx container does not resolve the issue, and updating it will actually cause nginx to timeout itself (except on the actual host it is running on -- as explained above)... so I suspect it is a swarm issue with ingress routing rather than a misconfiguration of nginx.

Let me know if there is any more information I can provide! Thanks!

@thaJeztah
Copy link
Member

@bryanrite there are some issue still being investigated with swarm networking, and I'm really no networking "expert", so not sure I can personally help further triaging the issue, but @mrjana or @mavenugo should be able to if needed (and better know what to ask for). Thanks for your offer for helping!

@icecrime icecrime added the priority/P1 Important: P1 issues are a top priority and a must-have for the next release. label Sep 19, 2016
@mrjana
Copy link
Contributor

mrjana commented Sep 20, 2016

@bryanrite You are most likely having a gossip failure problem. You can confirm this by checking your daemon logs. When gossip fails there is a bug which makes the recovery problematic. moby/libnetwork#1446 has a fix for this.

But in general if you are in DO, I would avoid using the public IP to form the swarm. I think if raft and gossip are running in the public network there is either a network congestion or some kind of network throttling going on which causes these flapping. Instead you can enable a private network and form the swarm using the private network. This may help although I can't say for sure.

@bryanrite
Copy link

@mrjana Thanks for the update! I have been following your discussion on #26563 ... I have tried running on both private and public networks but get the same result. Different from @ascora , I am not load testing or hammering the swarm in any way, so maybe DO is throttling the gossip traffic for some reason? I don't remember offhand if the docker daemon logs had any gossip issues, but I will take a look again, and see if I can run it on the latest master with your libnetwork fix.

Thanks!

@SvenAbels
Copy link

@bryanrite @mrjana Just to clarify: I can reproduce the problem quickly with load testing but it originally appeared - and continues to appear - on our normal productive servers as well. It only takes longer to appear if I don't do load testing. Also, for the load testing I used DO but our productive servers are not at DO so I don't think that DO is throttling the traffic.

@mrjana
Copy link
Contributor

mrjana commented Sep 20, 2016

@bryanrite Thanks and please let us know what you find from the logs.

@ascora If the problem is still happening in your normal production servers is it possible to get daemon logs from there just to make sure you are seeing the same problem? Originally we both thought that it was because of some node restart and if that is what is still triggering the issue in your normal production servers that is a different issue(which is already fixed in docker/docker master btw). Just want to make sure if we are not missing any other issues in the mix.

@bryanrite
Copy link

@mrjana Would it be useful if I gave you access to a small cluster on DO that is exhibiting the problem? I've been able to reproduce it pretty consistently -- although like @ascora mentioned, sometimes it takes a few service updates or some load to kick it off.

If so, I can set up something for you in a day or so -- have an deadline I have to meet first 😄

@SvenAbels
Copy link

@mrjana I re-created the swarm on them since the 15th but I have attached the (anonymized) logs from the 16th. I created a new service on the 16th and noticed yesterday that it was not reachable from all IPs. However, I don't see the ping problem in the logs from the 16th.

I also got the logs from the 15th and I can send them as well and indeed they contain the ping message. But on the 15th I tested so many things that I don't know if this was related to a reboot or not.

x3.manager.txt
x1.manager.txt
x2.manager.txt

@mrjana
Copy link
Contributor

mrjana commented Sep 20, 2016

@bryanrite Sure please provide access as that would confirm(or not) if this issue is related to gossip failure. So we can tag this issue accordingly.

@mrjana
Copy link
Contributor

mrjana commented Sep 20, 2016

@ascora If what you provided here are the logs from the 16th these also show gossip failure and that is still your problem, although in this case it failed faster with UDP ping retry before TCP pings could timeout. But at the end the effect is the same. You can also see from the logs(from x1.manager.txt) that at the same time gossip is failing, raft also underwent leader re-election and GRPC session towards workers also timed out. All of these processes inside docker are completely independent but they somehow see the same problem which indicates some congestion at some level below these processes.

@mrjana
Copy link
Contributor

mrjana commented Sep 26, 2016

I am going to close this as a dupe of #26563 since the symptoms are similar. Feel free to reopen if you think this is a different bug.

@mrjana mrjana closed this as completed Sep 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. priority/P1 Important: P1 issues are a top priority and a must-have for the next release. version/1.12
Projects
None yet
Development

No branches or pull requests