Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

Pods are uanble to resolve DNS for any of Azure service or other external sites. #2971

Closed
svick7859 opened this issue May 15, 2018 · 43 comments
Labels

Comments

@svick7859
Copy link

svick7859 commented May 15, 2018

Is this a request for help?:

Yes

Is this an ISSUE or FEATURE REQUEST? (choose one):

Issue

What version of acs-engine?:
1.31.1

Kubernetes

If this is a ISSUE, please:

We've been running a couple K8 clusters for a couple months. Last weekend, everything stopped working. Specifically, DNS requests were failing. We investigated our network for any surprise changes and nothing changed in Azure. Our pods are unable to resolve DNS names.

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)

Kubernetes

What happened:

Pods stopped resolving DNS names for Azure services such as Postgres, API Manager, REDIS, MongoDB Blob Store, etc. as well as some external services such as Auth0. Those same sites can be resolved if we test from the nodes on which the pods are running.

What you expected to happen:
We should never experience DNS resolution issues. This was all working a week ago.

How to reproduce it (as minimally and precisely as possible):
I can easily reproduce from within my pods. Not sure how you would reproduce if you're not experiencing DNS issues.

Anything else we need to know:

We have tried a bunch of things to resolve.

  1. Deleted the pods to get new replicas
  2. Rebooted the VMs
  3. Deleted the Kube-system DNS services to get new replicas

Nothing works.

As referenced above, we can telnet anywhere from any of the nodes without issue. But from within the pod it fails.

Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Go version: go1.7.5
Git commit: 092cba3
Built: Wed Feb 8 06:50:14 2017
OS/Arch: linux/amd64
Experimental: false

Master - resolve.conf
nameserver 168.63.129.16
search n2wydozmtkcurochzy4mep2cdc.ax.internal.cloudapp.net

@jackfrancis
Copy link
Member

Hi @svick7859, thanks for the detail. What does kube-dns in the kube-system namespace look like? Is it Running? Anything interesting in the logs?

@svick7859
Copy link
Author

@jackfrancis Yes, it's running. There are two instances. Which logs? Here is dnsmasq

I0515 15:59:19.514245 1 main.go:76] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --no-resolv --server=127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/ip6.arpa/127.0.0.1#10053 --log-facility=-] true} /kube-dns-config 10000000000}
I0515 15:59:19.514462 1 nanny.go:94] Starting dnsmasq [-k --cache-size=1000 --no-resolv --server=127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/ip6.arpa/127.0.0.1#10053 --log-facility=-]
I0515 15:59:20.485408 1 nanny.go:119]
W0515 15:59:20.485484 1 nanny.go:120] Got EOF from stdout
I0515 15:59:20.485408 1 nanny.go:116] dnsmasq[11]: started, version 2.78 cachesize 1000
I0515 15:59:20.485525 1 nanny.go:116] dnsmasq[11]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0515 15:59:20.485637 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain ip6.arpa
I0515 15:59:20.485658 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I0515 15:59:20.485677 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053
I0515 15:59:20.485744 1 nanny.go:116] dnsmasq[11]: read /etc/hosts - 7 addresses
I0516 02:15:52.475561 1 nanny.go:116] dnsmasq[11]: nameserver 127.0.0.1 refused to do a recursive query
I0516 12:10:02.816321 1 nanny.go:116] dnsmasq[11]: nameserver 127.0.0.1 refused to do a recursive query
I0516 12:10:02.826237 1 nanny.go:116] dnsmasq[11]: nameserver 127.0.0.1 refused to do a recursive query

@jackfrancis
Copy link
Member

kubedns would be great as well

@jackfrancis
Copy link
Member

cf. #2880

@svick7859
Copy link
Author

Getting them now

@svick7859
Copy link
Author

@jackfrancis

I0515 15:59:19.514245 1 main.go:76] opts: {{/usr/sbin/dnsmasq [-k --cache-size=1000 --no-resolv --server=127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/ip6.arpa/127.0.0.1#10053 --log-facility=-] true} /kube-dns-config 10000000000}
I0515 15:59:19.514462 1 nanny.go:94] Starting dnsmasq [-k --cache-size=1000 --no-resolv --server=127.0.0.1#10053 --server=/in-addr.arpa/127.0.0.1#10053 --server=/ip6.arpa/127.0.0.1#10053 --log-facility=-]
I0515 15:59:20.485408 1 nanny.go:119]
W0515 15:59:20.485484 1 nanny.go:120] Got EOF from stdout
I0515 15:59:20.485408 1 nanny.go:116] dnsmasq[11]: started, version 2.78 cachesize 1000
I0515 15:59:20.485525 1 nanny.go:116] dnsmasq[11]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
I0515 15:59:20.485637 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain ip6.arpa
I0515 15:59:20.485658 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053 for domain in-addr.arpa
I0515 15:59:20.485677 1 nanny.go:116] dnsmasq[11]: using nameserver 127.0.0.1#10053
I0515 15:59:20.485744 1 nanny.go:116] dnsmasq[11]: read /etc/hosts - 7 addresses
I0516 02:15:52.475561 1 nanny.go:116] dnsmasq[11]: nameserver 127.0.0.1 refused to do a recursive query
I0516 12:10:02.816321 1 nanny.go:116] dnsmasq[11]: nameserver 127.0.0.1 refused to do a recursive query
I0516 12:10:02.826237 1 nanny.go:116] dnsmasq[11]: nameserver 127.0.0.1 refused to do a recursive query
root@k8s-master-13529063-0:/cnh-evo-baseline#
root@k8s-master-13529063-0:
/cnh-evo-baseline# kubectl logs -f --namespace=kube-system kube-dns-v20-5d9fdc7448-rpdl5
Error from server (BadRequest): a container name must be specified for pod kube-dns-v20-5d9fdc7448-rpdl5, choose one of: [kubedns dnsmasq healthz]
root@k8s-master-13529063-0:~/cnh-evo-baseline# kubectl logs --namespace=kube-system kube-dns-v20-5d9fdc7448-rpdl5 kubedns
I0515 15:59:19.031552 1 dns.go:48] version: 1.14.8
I0515 15:59:19.032672 1 server.go:71] Using configuration read from directory: /kube-dns-config with period 10s
I0515 15:59:19.032740 1 server.go:119] FLAG: --alsologtostderr="false"
I0515 15:59:19.032804 1 server.go:119] FLAG: --config-dir="/kube-dns-config"
I0515 15:59:19.032838 1 server.go:119] FLAG: --config-map=""
I0515 15:59:19.032856 1 server.go:119] FLAG: --config-map-namespace="kube-system"
I0515 15:59:19.032875 1 server.go:119] FLAG: --config-period="10s"
I0515 15:59:19.032979 1 server.go:119] FLAG: --dns-bind-address="0.0.0.0"
I0515 15:59:19.033015 1 server.go:119] FLAG: --dns-port="10053"
I0515 15:59:19.033038 1 server.go:119] FLAG: --domain="cluster.local."
I0515 15:59:19.033060 1 server.go:119] FLAG: --federations=""
I0515 15:59:19.033127 1 server.go:119] FLAG: --healthz-port="8081"
I0515 15:59:19.033148 1 server.go:119] FLAG: --initial-sync-timeout="1m0s"
I0515 15:59:19.033169 1 server.go:119] FLAG: --kube-master-url=""
I0515 15:59:19.033250 1 server.go:119] FLAG: --kubecfg-file=""
I0515 15:59:19.033270 1 server.go:119] FLAG: --log-backtrace-at=":0"
I0515 15:59:19.033344 1 server.go:119] FLAG: --log-dir=""
I0515 15:59:19.033387 1 server.go:119] FLAG: --log-flush-frequency="5s"
I0515 15:59:19.033419 1 server.go:119] FLAG: --logtostderr="true"
I0515 15:59:19.033467 1 server.go:119] FLAG: --nameservers=""
I0515 15:59:19.033497 1 server.go:119] FLAG: --stderrthreshold="2"
I0515 15:59:19.033516 1 server.go:119] FLAG: --v="2"
I0515 15:59:19.033535 1 server.go:119] FLAG: --version="false"
I0515 15:59:19.033589 1 server.go:119] FLAG: --vmodule=""
I0515 15:59:19.033797 1 server.go:201] Starting SkyDNS server (0.0.0.0:10053)
I0515 15:59:19.033908 1 server.go:222] Skydns metrics not enabled
I0515 15:59:19.033988 1 dns.go:146] Starting endpointsController
I0515 15:59:19.034003 1 dns.go:149] Starting serviceController
I0515 15:59:19.034063 1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0515 15:59:19.034082 1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0515 15:59:19.534770 1 dns.go:170] Initialized services and endpoints from apiserver
I0515 15:59:19.534802 1 server.go:135] Setting up Healthz Handler (/readiness)
I0515 15:59:19.534809 1 server.go:140] Setting up cache handler (/cache)
I0515 15:59:19.534852 1 server.go:126] Status HTTP port 8081

@jackfrancis
Copy link
Member

See temporary workaround here for another customer, if you're able to punt on RCA in order to restore service:

#2880 (comment)

@svick7859
Copy link
Author

Let me try that. Stay tuned . . .

@svick7859
Copy link
Author

@jackfrancis are you talking about grabbing the newer versions of kubedns?

@jackfrancis
Copy link
Member

@svick7859 I'm referring to the stubDomains workaround

@svick7859
Copy link
Author

@jackfrancis We tried and perhaps are doing something wrong. nslookup resolves on 1st attempt (name server is default kube-dns - 10.0.0.10 (should that be the case if we're using upstream 8.8.8.8), but subsequent attempts fail. This is similar to what I saw without using the stubDomain. So it seems like the stub (we're only using the upstream setting to point to google) isn't having an affect.

Hope that wasn't too confusing! My head is spinning with this issue.

@jackfrancis
Copy link
Member

Do you have the original api model used to deploy this cluster (or clusters)? Could you quickly use it and deploy a new (say, 2 node) cluster to a new resource group and validate that the same symptoms are not present? That would suggest that whatever upstream DNS service is broken is not general but, perhaps, limited to the source NAT(s) your broken cluster(s) originates from.

@svick7859
Copy link
Author

I have another environment experiencing the same issue. Stupid Q, but could there be something between 10.0.0.10 and 8.8.8.8 that is broken? I only ask because even though we added 8..., it still fails.

@svick7859
Copy link
Author

Should the kubedns config look different after applying the changes to add the upstream?

I0516 19:03:25.264502 1 server.go:71] Using configuration read from directory: /kube-dns-config with period 10s
I0516 19:03:25.264543 1 server.go:119] FLAG: --alsologtostderr="false"
I0516 19:03:25.264558 1 server.go:119] FLAG: --config-dir="/kube-dns-config"
I0516 19:03:25.264586 1 server.go:119] FLAG: --config-map=""
I0516 19:03:25.264591 1 server.go:119] FLAG: --config-map-namespace="kube-system"
I0516 19:03:25.264596 1 server.go:119] FLAG: --config-period="10s"
I0516 19:03:25.264602 1 server.go:119] FLAG: --dns-bind-address="0.0.0.0"
I0516 19:03:25.264607 1 server.go:119] FLAG: --dns-port="10053"
I0516 19:03:25.264613 1 server.go:119] FLAG: --domain="cluster.local."
I0516 19:03:25.264621 1 server.go:119] FLAG: --federations=""
I0516 19:03:25.264652 1 server.go:119] FLAG: --healthz-port="8081"
I0516 19:03:25.264665 1 server.go:119] FLAG: --initial-sync-timeout="1m0s"
I0516 19:03:25.264671 1 server.go:119] FLAG: --kube-master-url=""
I0516 19:03:25.264677 1 server.go:119] FLAG: --kubecfg-file=""
I0516 19:03:25.264681 1 server.go:119] FLAG: --log-backtrace-at=":0"
I0516 19:03:25.264689 1 server.go:119] FLAG: --log-dir=""
I0516 19:03:25.264694 1 server.go:119] FLAG: --log-flush-frequency="5s"
I0516 19:03:25.264699 1 server.go:119] FLAG: --logtostderr="true"
I0516 19:03:25.264711 1 server.go:119] FLAG: --nameservers=""
I0516 19:03:25.264716 1 server.go:119] FLAG: --stderrthreshold="2"
I0516 19:03:25.264740 1 server.go:119] FLAG: --v="2"
I0516 19:03:25.264745 1 server.go:119] FLAG: --version="false"
I0516 19:03:25.264752 1 server.go:119] FLAG: --vmodule=""
I0516 19:03:25.264794 1 server.go:201] Starting SkyDNS server (0.0.0.0:10053)

@svick7859
Copy link
Author

svick7859 commented May 16, 2018

Not sure if this helps, but using the netshoot tool (https://github.com/nicolaka/netshoot) I ran the test on the pod and on the master to display the DNS results.

I removed our service names and IPs.
NSLOOKUP from container

20:05:56.329078 IP hidden.57839 > 168.63.129.16.53: 64327+ AAAA? localhost.hiddeninternal.cloudapp.net. (79)
20:05:56.329094 IP hidden.46867 > 168.63.129.16.53: 18321+ A? localhost.hiddeninternal.cloudapp.net. (79)
20:05:56.331704 IP 168.63.129.16.53 > hidden.57839: 64327 NXDomain* 0/1/0 (158)
20:05:56.331841 IP 168.63.129.16.53 > hidden.46867: 18321 NXDomain* 0/1/0 (158)
20:05:56.332424 IP hidden.39784 > 168.63.129.16.53: 63068+ A? localhost. (27)
20:05:56.332859 IP hidden.44524 > 168.63.129.16.53: 21984+ AAAA? localhost. (27)
20:05:56.334913 IP 168.63.129.16.53 > hidden.39784: 63068 1/0/0 A 127.0.0.1 (43)
20:05:56.335089 IP 168.63.129.16.53 > hidden.44524: 21984 0/1/0 (102)
20:05:56.451932 IP hidden.33048 > 168.63.129.16.53: 25431+ AAAA? localhost.hiddeninternal.cloudapp.net. (79)
20:05:56.451933 IP hidden.41829 > 168.63.129.16.53: 46610+ A? localhost.hiddeninternal.cloudapp.net. (79)
20:05:56.453624 IP 168.63.129.16.53 > hidden.41829: 46610 NXDomain* 0/1/0 (158)
20:05:56.453780 IP 168.63.129.16.53 > hidden.33048: 25431 NXDomain* 0/1/0 (158)
20:05:56.454011 IP hidden.48403 > 168.63.129.16.53: 46391+ AAAA? localhost. (27)
20:05:56.454325 IP hidden.59040 > 168.63.129.16.53: 2168+ A? localhost. (27)
20:05:56.455805 IP 168.63.129.16.53 > hidden.48403: 46391 0/1/0 (102)
20:05:56.455970 IP 168.63.129.16.53 > hidden.59040: 2168 1/0/0 A 127.0.0.1 (43)
20:05:57.381365 IP hidden.41713 > 168.63.129.16.53: 45755+ A? localhost.hiddeninternal.cloudapp.net. (79)
20:05:57.381366 IP hidden.58635 > 168.63.129.16.53: 24526+ AAAA? localhost.hiddeninternal.cloudapp.net. (79)
20:05:57.383325 IP 168.63.129.16.53 > hidden.41713: 45755 NXDomain* 0/1/0 (158)
20:05:57.386363 IP 168.63.129.16.53 > hidden.58635: 24526 NXDomain* 0/1/0 (158)
20:05:57.386727 IP hidden.42573 > 168.63.129.16.53: 59180+ AAAA? localhost. (27)
20:05:57.387156 IP hidden.43274 > 168.63.129.16.53: 4704+ A? localhost. (27)
20:05:57.388001 IP 168.63.129.16.53 > hidden.42573: 59180 0/1/0 (102)
20:05:57.388151 IP 168.63.129.16.53 > hidden.43274: 4704 1/0/0 A 127.0.0.1 (43)
20:05:58.289632 IP hidden.41072 > 168.63.129.16.53: 56203+ A? localhost.hiddeninternal.cloudapp.net. (79)
20:05:58.289675 IP hidden.38089 > 168.63.129.16.53: 19754+ AAAA? localhost.hiddeninternal.cloudapp.net. (79)
20:05:58.291409 IP 168.63.129.16.53 > hidden.41072: 56203 NXDomain* 0/1/0 (158)
20:05:58.291595 IP 168.63.129.16.53 > hidden.38089: 19754 NXDomain* 0/1/0 (158)
20:05:58.291887 IP hidden.50245 > 168.63.129.16.53: 31362+ A? localhost. (27)
20:05:58.292162 IP hidden.34505 > 168.63.129.16.53: 55236+ AAAA? localhost. (27)
20:05:58.293812 IP 168.63.129.16.53 > hidden.50245: 31362 1/0/0 A 127.0.0.1 (43)
20:05:58.294006 IP 168.63.129.16.53 > hidden.34505: 55236 0/1/0 (102)

NSLOOKUP from master

20:06:53.923163 IP hidden.45502 > 168.63.129.16.53: 446+ A? AzureAPI-manager.net. (45)
20:06:53.934894 IP 168.63.129.16.53 > hidden.45502: 446 4/0/0 CNAME hidden.trafficmanager.net., CNAME hidden-westeurope-01.regional.azure-api.net., CNAME hiddden.cloudapp.net., A IP x.x.x.x (266)

@jackfrancis
Copy link
Member

@slack Does any of the above suggest source address filtering from 168.63.129.16?

@jackfrancis
Copy link
Member

@svick7859 am I interpreting the from-container-logs correctly in concluding that the resolver is returning NXDOMAIN responses for all requests? In other words, port 53 communications are established from the perspective of the container, and the resolver is receiving requests and returning responses (record not found)?

@mbarion
Copy link

mbarion commented May 17, 2018

I overtake for today the discussin from my collegue...

actually we tried to detroy and rebuild the cluster and everithing is now working, only change made is the ACS-Engine version used to create the manifest for deploy the VMs.
we still have one environment were we can riproduce the issue and we are going ahead there

@jackfrancis
Copy link
Member

@mbarion Thanks for the update. If you still have a cluster that is able to reproduce the DNS problem, I would love to get this data:

  • setup a tcpdump for dns traffic on a node that has a scheduled pod that is doing DNS requests
  • do some DNS lookups from the node itself (from the ubuntu CLI, for example) -- we expect these lookups to succeed
  • let the pod run long enough to do its own DNS lookups -- we expect these lookups to fail

What we'd like to see is: what is the difference between the DNS lookups going out the wire, if any? Are the pod-originating lookups being SNAT'd in a particular way compared to the DNS lookups from the node OS?

@mbarion
Copy link

mbarion commented May 17, 2018 via email

@mbarion
Copy link

mbarion commented May 17, 2018 via email

@mbarion
Copy link

mbarion commented May 17, 2018 via email

@jackfrancis
Copy link
Member

@slack @rite2nikhil see the tcpdump output above, all requests to Azure DNS are sourced from 10.250.10.10. Do we think this rules out any SNAT-related DNS filtering that would affect only pod-originating lookup requests?

@jackfrancis
Copy link
Member

@svick7859 if you still have a repro-able environment are you willing to do some real-time troubleshooting?

@svick7859
Copy link
Author

svick7859 commented May 18, 2018 via email

@jackfrancis
Copy link
Member

Is everyone on this thread using the Azure CNI network implementation on clusters experiencing these symptoms? (--network-plugin=cni kubelet runtime config option)

@khenidak
Copy link
Contributor

khenidak commented May 24, 2018

Here is a quick status update:

There are two problems happening concurrently, while similar they are not related:

  1. kubedns stops resolving names and logs 'i/o timeoutwhile connecting to VNET DNS server. A transient connection error - to metadata endpoint becomes persisted error in kubedns. The pod network namespace can connect to external and internal vnet ips. And you can confirm this by
kubectl --namespace=kube-system exec -it ${KUBE-DNS-POD-NAME} -c kubedns -- sh
#run ping/or nslookup using metadata endpoint

Fix
restarting the pod and or the container should suffice to fix this.

Stop this from happening
edit kubernetes dns add on master (repeat for every master)

vi /etc/kubernetes/addons/kube-dns-deployment.yaml

Change the args for healthz container to following

- "--cmd=nslookup bing.com 127.0.0.1 >/dev/null"
- "--url=/healthz-dnsmasq"
- "--cmd=nslookup bing.com 127.0.0.1:10053 >/dev/null"
- "--url=/healthz-kubedns"
- "--port=8080"
- "--quiet"

Instead of using nslookup kubernetes.... This will force the kubedns container to restart if the above condition occurs.

  1. kubedns entire network namespace loses connection to internal (metadata endpoint) and external ips. This has been observed on Azure CNI but has not been confirmed on other CNI yet. to confirm this. jump into any of the containers in kubedns pod and test the network (even curl https://10.0.0.1 will fail, while other pods on the same node is functioning properly).

Solutions:

  1. move the pod to a different node.
  2. restart the node.

We are actively working on getting RCA for this issue.

@sanjusoftware
Copy link

Thanks 🙏 @khenidak for helping us troubleshoot this issue. Looking forward to a permanent solution sooner, while we keep moving with this temp solution

@jackfrancis
Copy link
Member

Folks on this thread, kindly: the next time you encounter this issue please:

  • share the output of “ebtables -t nat -L”
  • cat /var/log/kern.log

Thanks!

@paulgmiller
Copy link
Member

Hmm I had a similar issue (can't resolve any external dns) except it only effected new pods. We mitigated it by scaling up our nodes. I could still repro it by deploying many replicas of image: paulgmiller/dnsprobe. Is that close enough that you;d want to see etables and kern log from the kube-dns pods if I force another repro?

@jackfrancis
Copy link
Member

To update folks who are still running into this issue, installing daemonset onto an Azure CNI-based cluster has proven to make ebtables more resilient over time:

https://hub.docker.com/r/containernetworking/networkmonitor/

Here's a relatively painless way to do it:

$ kubectl create -f https://raw.githubusercontent.com/Azure/acs-engine/master/parts/k8s/addons/azure-cni-networkmonitor.yaml

And then, edit the daemonset to point to the actual image:

$ kubectl edit daemonset azure-cni-networkmonitor -n kube-system

Replace <azureCNINetworkMonitorImage> with containernetworking/networkmonitor:v0.0.4.

@huydinhle
Copy link

huydinhle commented Jul 27, 2018

@jackfrancis Deploying this deamonset indeed solves our problem with our new pod coming up, and network within the pod didn't get configured correebtablesctly ( can't talk to other pods at all, can't query dns name).

However, since this affect our production cluster for the last 2 weeks, can you give me an explanation on why this daemonset help?

Thank you

@jackfrancis
Copy link
Member

@sharmasushant @nisheeth-ms can better explain what the azure CNI monitor daemonset does exactly (and what it doesn't do!)

@huydinhle
Copy link

I assume that the azure cni didn't make correct changes to ebtables all the time like it should. so this deamonset will help with that from what you post above.

Thank you

@sukrit007
Copy link

I keep running into this issue and lookiong into vnet logs I see:

2018/07/27 16:23:57 [cni-net] Plugin stopped.
2018/07/27 16:23:59 SendReport failed due to [Azure CNI] HTTP Post returned statuscode 500
2018/07/27 16:24:03 [cni-net] Plugin azure-vnet version v1.0.7.
2018/07/27 16:24:03 [cni-net] Running on Linux version 4.15.0-1013-azure (buildd@lcy01-amd64-006) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)) #13~16.04.2-Ubuntu SMP Wed May 30 01:39:27 UTC 2018
20

Here is output of etables

Bridge table: nat

Bridge chain: PREROUTING, entries: 36, policy: ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.192.226 -j arpreply --arpreply-mac 0:d:3a:97:ea:10
-p ARP -i eth0 --arp-op Reply -j dnat --to-dst ff:ff:ff:ff:ff:ff --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.192.237 -j arpreply --arpreply-mac 4a:f2:a8:ab:7f:be
-p IPv4 -i eth0 --ip-dst 10.100.192.237 -j dnat --to-dst 4a:f2:a8:ab:7f:be --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.192.235 -j arpreply --arpreply-mac f2:11:f1:f2:a4:f8
-p IPv4 -i eth0 --ip-dst 10.100.192.235 -j dnat --to-dst f2:11:f1:f2:a4:f8 --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.192.239 -j arpreply --arpreply-mac ea:55:9a:c3:40:3d
-p IPv4 -i eth0 --ip-dst 10.100.192.239 -j dnat --to-dst ea:55:9a:c3:40:3d --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.193.62 -j arpreply --arpreply-mac 22:f:5b:aa:2f:ce
-p IPv4 -i eth0 --ip-dst 10.100.193.62 -j dnat --to-dst 22:f:5b:aa:2f:ce --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.192.253 -j arpreply --arpreply-mac a2:54:83:37:27:58
-p IPv4 -i eth0 --ip-dst 10.100.192.253 -j dnat --to-dst a2:54:83:37:27:58 --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.193.20 -j arpreply --arpreply-mac 52:4d:30:c1:5c:2e
-p IPv4 -i eth0 --ip-dst 10.100.193.20 -j dnat --to-dst 52:4d:30:c1:5c:2e --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.192.247 -j arpreply --arpreply-mac be:4e:f5:3b:60:c1
-p IPv4 -i eth0 --ip-dst 10.100.192.247 -j dnat --to-dst be:4e:f5:3b:60:c1 --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.193.64 -j arpreply --arpreply-mac 36:63:78:8d:5:b9
-p IPv4 -i eth0 --ip-dst 10.100.193.64 -j dnat --to-dst 36:63:78:8d:5:b9 --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.193.55 -j arpreply --arpreply-mac e6:bc:e2:34:38:3d
-p IPv4 -i eth0 --ip-dst 10.100.193.55 -j dnat --to-dst e6:bc:e2:34:38:3d --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.193.34 -j arpreply --arpreply-mac fa:7c:82:3a:f7:1d
-p IPv4 -i eth0 --ip-dst 10.100.193.34 -j dnat --to-dst fa:7c:82:3a:f7:1d --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.192.248 -j arpreply --arpreply-mac 9e:ad:d6:77:83:c1
-p IPv4 -i eth0 --ip-dst 10.100.192.248 -j dnat --to-dst 9e:ad:d6:77:83:c1 --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.193.8 -j arpreply --arpreply-mac 26:8:39:54:1d:6
-p IPv4 -i eth0 --ip-dst 10.100.193.8 -j dnat --to-dst 26:8:39:54:1d:6 --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.193.11 -j arpreply --arpreply-mac 12:c4:75:38:69:b8
-p IPv4 -i eth0 --ip-dst 10.100.193.11 -j dnat --to-dst 12:c4:75:38:69:b8 --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.193.44 -j arpreply --arpreply-mac a2:f8:17:bd:7d:67
-p IPv4 -i eth0 --ip-dst 10.100.193.44 -j dnat --to-dst a2:f8:17:bd:7d:67 --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.193.14 -j arpreply --arpreply-mac 6a:d3:e8:4c:a9:b8
-p IPv4 -i eth0 --ip-dst 10.100.193.14 -j dnat --to-dst 6a:d3:e8:4c:a9:b8 --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.192.228 -j arpreply --arpreply-mac 56:ff:d7:3:92:15
-p IPv4 -i eth0 --ip-dst 10.100.192.228 -j dnat --to-dst 56:ff:d7:3:92:15 --dnat-target ACCEPT
-p ARP --arp-op Request --arp-ip-dst 10.100.193.7 -j arpreply --arpreply-mac 92:c5:c6:ab:64:52
-p IPv4 -i eth0 --ip-dst 10.100.193.7 -j dnat --to-dst 92:c5:c6:ab:64:52 --dnat-target ACCEPT

Bridge chain: OUTPUT, entries: 0, policy: ACCEPT

Bridge chain: POSTROUTING, entries: 1, policy: ACCEPT
-s Unicast -o eth0 -j snat --to-src 0:d:3a:97:ea:10 --snat-arp --snat-target ACCEPT

@sharmasushant
Copy link
Contributor

@huydinhle This is a self healing monitor. It is possible in failure scenarios that CNI was not able to cleanup the state properly. e.g., CNI binary crashes in the middle due to permission/out of memory/ something else. This monitor detects any leftover state and cleans it up. The functionality will be built in the CNI itself in the comin releases. So, the monitor is only temporary and will eventually be not needed.

@sukrit007
Copy link

Even after the patch, I am seeing occasional issues in DNS logs:

skydns: failure to forward request "read udp xx.xx.xx.xx:42382->xx.xx.xx.xx:53: i/o timeout"

@sharmasushant
Copy link
Contributor

by patch, do you mean the cni monitor?
how occasional? Is the issue only in logs, or do you see it surfaced to your application?

@rameezk
Copy link

rameezk commented Nov 6, 2018

We are also starting to see a couple of "Temporary failure in name resolution" issues popping up when trying to resolve external hosts. This is bubbling up to our applications.

I can confirm that we are using @jackfrancis recommendation at #2971 (comment)

@stale
Copy link

stale bot commented Mar 9, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.

@stale stale bot added the stale label Mar 9, 2019
@stale stale bot closed this as completed Mar 16, 2019
@piotrgwiazda
Copy link

What is the status of this issue? Is there a new one opened? We're facing same issue on new AKS.

@jackfrancis
Copy link
Member

@piotrgwiazda this issue itself is stale, perhaps there is something operational in your cluster, I would recommend engaging the standard Azure support channels that your AKS service includes

@piotrgwiazda
Copy link

piotrgwiazda commented Jul 24, 2019 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests