Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

short-form dns query *nslookup kubernetes.default* not working #109

Closed
jwfang opened this issue Jun 19, 2017 · 10 comments
Closed

short-form dns query *nslookup kubernetes.default* not working #109

jwfang opened this issue Jun 19, 2017 · 10 comments

Comments

@jwfang
Copy link

jwfang commented Jun 19, 2017

in case someone encounter the same problem, i write my finding here.

kube-dns behaviour:

  1. kube-dns POD's /etc/resolv.conf is usually the same as the host;
  2. kubedns will forward unknown domain to name server in its resolv.conf;
    NOTE: short-form query such as kubernetes.default is unknown to kubedns
  3. kubedns seem to be use only the first nameserver.
    EDIT: from https://github.com/skynetservices/skydns/blob/f694f5637b31e2b9c9871fb396773d5d18b9309e/server/exchange.go#L29, it's not doing NSRotate. when no NSRotate, it always use the first name server first, and it only retry on connection error, for application error it just directly forward upstream error code.

so the short-form query works like this:

  1. client send short-form query to kube-dns;
  2. kube-dns know nothing about it, forward to external nameserver in resolv.conf;
  3. external name server return ERROR to kube-dns;
  4. kube-dns forward failure to client;
  5. client append search domain in its resolv.conf and goto 1 to retry.

client behaviour:
for 5, different client seem to have differnt behaviour regards to differnt error from 3.
for busybox, it seem to be only append search for NXDOMAIN, and not append search for REFUSED;
and for alpine and tutum/dnsutils, it will append search for both NXDOMAIN and REFUSED.

my installation it's a bit unusually, although they have idential /etc/resolv.conf, the first name server behaviour differently on different node: some are not recusive and will return REFUSED, and the other
will return NXDOMAIN.

so i got this weird behaviour:
when kube-dns is on NXDOMAIN node, busybox nslookup testing works;
when kube-dns is on REFUSED node, busybox nslookup testing fails.
and alpine/tutum/dnsutils always works regards which node kube-dns is on.

so, when deploy kube-dns, you should ensure the first nameserver in your host /etc/resolv.conf works as expected.

=============== BELOW are original question ===============================
i enabled RBAC for my on-premise k8s cluster, but found cross-name space DNS query different from non-RBAC.
i didn't find any document for this behaviour, so this issue.

for no-RBAC:

  1. i can do cross-namespace DNS query for service in ns1 from ns2 using svc1.ns1;
  2. i can query service in the same namespace using svc1.ns1 with namespace.

but in RBAC, i have to use the FQDN:

  1. can't do cross-namespace query using svc1.ns1 from ns2;
  2. can't query service in the same namespace using svc1.ns1 with namespace;

from a busybox in default namespace, i got the following output:

/ # nslookup nginx-deployment
Server:    10.233.0.3
Address 1: 10.233.0.3 kubedns.kube-system.svc.cluster.local

Name:      nginx-deployment
Address 1: 10.233.33.138 nginx-deployment.default.svc.cluster.local
/ # nslookup nginx-deployment.default
Server:    10.233.0.3
Address 1: 10.233.0.3 kubedns.kube-system.svc.cluster.local

nslookup: can't resolve 'nginx-deployment.default'
/ # nslookup nginx-deployment.kube-system
Server:    10.233.0.3
Address 1: 10.233.0.3 kubedns.kube-system.svc.cluster.local

nslookup: can't resolve 'nginx-deployment.kube-system'
/ # nslookup nginx-deployment.kube-system.svc.cluster.local
Server:    10.233.0.3
Address 1: 10.233.0.3 kubedns.kube-system.svc.cluster.local

Name:      nginx-deployment.kube-system.svc.cluster.local
Address 1: 10.233.18.29 nginx-deployment.kube-system.svc.cluster.local

here is my container info:

  Containers:
   kubedns:
    Image:      gcr.io/google_containers/kubedns-amd64:1.9
   dnsmasq:
    Image:      gcr.io/google_containers/kube-dnsmasq-amd64:1.3
   healthz:
    Image:      gcr.io/google_containers/exechealthz-amd64:1.1
@cmluciano
Copy link

@kubernetes/sig-apps-misc @kubernetes/sig-network-misc
/kind dns

This seems expected. @jwfang Have you set up the proper permissions for these services/accounts to communicate?

@jwfang
Copy link
Author

jwfang commented Jun 20, 2017

@cmluciano thanks you.

i made the following changes for kube-dns pod:

  1. run as kube-dns pod as kube-system:kube-dns service account;
  2. patch system:kube-dns to add get verbs since i am using an old kube-dns image, according to [kube-dns RBAC] Default ClusterRole is insufficient kubernetes#45084

is there other things i need to do ?

what confused me is that the short form svcName.Namespace doesn't work.
i think it works somehow, but i can't reproduce it anymore.

i can't even query the short form in the same namespace, but nslookup kubernetes.default is in the official docs for checking if DNS is working correctly.

my remaining questions is:

  1. is the short form not support for RBAC, or not supported at all;
  2. what's the rational behind this, i mean no short form DNS query;
  3. my busybox's /etc/resolv.conf have the correct search config i think, then why the short form not pick up the search config, or is this controlled elsewhere ?

here is my /etc/resolv.conf for busybox. i expect kubernetes.default to will pick up the second search,
config and return result for me.

/ # cat /etc/resolv.conf
nameserver 10.233.0.3
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

/ # nslookup kubernetes.default
Server:    10.233.0.3
Address 1: 10.233.0.3 kubedns.kube-system.svc.cluster.local

nslookup: can't resolve 'kubernetes.default'

@liggitt
Copy link
Member

liggitt commented Jun 21, 2017

RBAC doesn't affect DNS queries at all... the kube-dns RBAC role was made in the kube 1.6 timeframe, and works with the version of kube-dns that was current at that time (gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.2)

@liggitt
Copy link
Member

liggitt commented Jun 21, 2017

I would still check the kube-dns logs... it seems like the old image is likely making additional API calls that your updates to the role did not catch

@jwfang
Copy link
Author

jwfang commented Jun 22, 2017

@liggitt thanks for help.

i didn't find anything suspicious from my kube-dns logs.

my kube-dns/kubedns logs look like this:

I0622 05:24:10.936977       1 dns.go:356] No service for endpoint "kube-controller-manager" in namespace "kube-system"
I0622 05:24:11.306899       1 dns.go:356] No service for endpoint "kube-scheduler" in namespace "kube-system"
I0622 05:24:11.871483       1 dns.go:515] Query for "kubernetes.default.svc.cluster.local.", exact: false
I0622 05:24:11.871541       1 dns.go:743] Not a federation query: len(["kubernetes" "default" "svc" "cluster" "local"]) != 4+len(["local" "cluster"])
I0622 05:24:11.871574       1 dns.go:634] Found 1 records for [local cluster svc default kubernetes] in the cache
I0622 05:24:11.871595       1 dns.go:641] getRecordsForPath retval=[{Host:10.233.0.1 Port:0 Priority:10 Weight:10 Text: Mail:false Ttl:30 TargetStrip:0 Group: Key:/skydns/local/cluster/svc/default/kubernetes/3766346139356635}], path=[local cluster svc default kubernetes]
I0622 05:24:11.871623       1 dns.go:544] Records for kubernetes.default.svc.cluster.local.: [{10.233.0.1 0 10 10  false 30 0  /skydns/local/cluster/svc/default/kubernetes/3766346139356635}]

and here is logs from kube-dns/dnsmasq:

dnsmasq[1]: started, version 2.76 cachesize 1000
dnsmasq[1]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
dnsmasq[1]: using local addresses only for domain com.svc.cluster.local
dnsmasq[1]: using local addresses only for domain svc.cluster.local.svc.cluster.local
dnsmasq[1]: using local addresses only for domain cluster.local.svc.cluster.local
dnsmasq[1]: using local addresses only for domain com.default.svc.cluster.local
dnsmasq[1]: using local addresses only for domain default.svc.cluster.local.default.svc.cluster.local
dnsmasq[1]: using local addresses only for domain cluster.local.default.svc.cluster.local
dnsmasq[1]: using nameserver 127.0.0.1#10053
dnsmasq[1]: read /etc/hosts - 7 addresses

the cluster.local.svc.cluster.local seems wierd to me, but not sure.

@bowei
Copy link
Member

bowei commented Jun 22, 2017

Can you post the YAML for your kube-dns pod to a gist?

@jwfang
Copy link
Author

jwfang commented Jun 22, 2017

sure, @bowei ,here is my yaml from kubectl edit deployment kubedns -n kube-system:
https://gist.github.com/jwfang/d1a69d2d19beaaecbe61452a4d10642f.

it's generated from: https://github.com/kubernetes-incubator/kubespray/blob/774c4d0d6fe2b5449432192ee2cde9c07ff1e897/roles/kubernetes-apps/ansible/templates/kubedns-deploy.yml.

currently i am not using RBAC, but with my certs changes at kubernetes-sigs/kubespray#1351. kubenetes.default still doesn't resolved.

i will try using a vanilla kubespray install and test, maybe it's due to how k8s is configured in kubespray.

@jwfang
Copy link
Author

jwfang commented Jun 22, 2017

did a vanilla install and changed kube-dns/dnsmasq to the following:

      - args:
        - --log-facility=-
        - --cache-size=1000
        - --no-resolv
        - --server=/cluster.local/127.0.0.1#10053
        - --server=/in-addr.arpa/127.0.0.1#10053

still can't resolve short form svc1.ns1, only FQDN works.

@jwfang
Copy link
Author

jwfang commented Jun 22, 2017

i guess i figure the cause of the problem somehow,

when doing a kubedns.kube-system query from default namespace, got the following from dnsmasq log:

dnsmasq[1]: cached 10.233.0.3 is kubedns.kube-system.svc.cluster.local
dnsmasq[1]: query[AAAA] kubedns.kube-system from 10.233.90.0
dnsmasq[1]: query[AAAA] kubedns.kube-system from 10.233.90.0
dnsmasq[1]: query[AAAA] kubedns.kube-system from 10.233.90.0
dnsmasq[1]: query[AAAA] kubedns.kube-system from 10.233.90.0
dnsmasq[1]: query[A] kubedns.kube-system from 10.233.90.0
dnsmasq[1]: query[A] kubedns.kube-system from 10.233.90.0
dnsmasq[1]: query[A] kubedns.kube-system from 10.233.90.0
dnsmasq[1]: query[A] kubedns.kube-system from 10.233.90.0

but doing a kubedns got me this:

dnsmasq[1]: reply 10.233.0.3 is kubedns.kube-system.svc.cluster.local
dnsmasq[1]: query[AAAA] kubedns.default.svc.cluster.local from 10.233.90.0
dnsmasq[1]: forwarded kubedns.default.svc.cluster.local to 127.0.0.1
dnsmasq[1]: reply kubedns.default.svc.cluster.local is NXDOMAIN
dnsmasq[1]: query[AAAA] kubedns.svc.cluster.local from 10.233.90.0
dnsmasq[1]: forwarded kubedns.svc.cluster.local to 127.0.0.1
dnsmasq[1]: reply kubedns.svc.cluster.local is NXDOMAIN
dnsmasq[1]: query[AAAA] kubedns.cluster.local from 10.233.90.0
dnsmasq[1]: forwarded kubedns.cluster.local to 127.0.0.1
dnsmasq[1]: reply kubedns.cluster.local is NXDOMAIN
dnsmasq[1]: query[AAAA] kubedns from 10.233.90.0
dnsmasq[1]: query[AAAA] kubedns.default.svc.cluster.local from 10.233.90.0
dnsmasq[1]: forwarded kubedns.default.svc.cluster.local to 127.0.0.1
dnsmasq[1]: reply kubedns.default.svc.cluster.local is NXDOMAIN
dnsmasq[1]: query[AAAA] kubedns.svc.cluster.local from 10.233.90.0
dnsmasq[1]: forwarded kubedns.svc.cluster.local to 127.0.0.1
dnsmasq[1]: reply kubedns.svc.cluster.local is NXDOMAIN
dnsmasq[1]: query[AAAA] kubedns.cluster.local from 10.233.90.0
dnsmasq[1]: forwarded kubedns.cluster.local to 127.0.0.1
dnsmasq[1]: reply kubedns.cluster.local is NXDOMAIN
dnsmasq[1]: query[AAAA] kubedns from 10.233.90.0
dnsmasq[1]: query[AAAA] kubedns.default.svc.cluster.local from 10.233.90.0
dnsmasq[1]: forwarded kubedns.default.svc.cluster.local to 127.0.0.1
dnsmasq[1]: reply kubedns.default.svc.cluster.local is NXDOMAIN
dnsmasq[1]: query[AAAA] kubedns.svc.cluster.local from 10.233.90.0
dnsmasq[1]: forwarded kubedns.svc.cluster.local to 127.0.0.1
dnsmasq[1]: reply kubedns.svc.cluster.local is NXDOMAIN
dnsmasq[1]: query[AAAA] kubedns.cluster.local from 10.233.90.0
dnsmasq[1]: forwarded kubedns.cluster.local to 127.0.0.1
dnsmasq[1]: reply kubedns.cluster.local is NXDOMAIN
dnsmasq[1]: query[AAAA] kubedns from 10.233.90.0
dnsmasq[1]: query[AAAA] kubedns.default.svc.cluster.local from 10.233.90.0
dnsmasq[1]: forwarded kubedns.default.svc.cluster.local to 127.0.0.1
dnsmasq[1]: reply kubedns.default.svc.cluster.local is NXDOMAIN
dnsmasq[1]: query[AAAA] kubedns.svc.cluster.local from 10.233.90.0
dnsmasq[1]: forwarded kubedns.svc.cluster.local to 127.0.0.1
dnsmasq[1]: reply kubedns.svc.cluster.local is NXDOMAIN
dnsmasq[1]: query[AAAA] kubedns.cluster.local from 10.233.90.0
dnsmasq[1]: forwarded kubedns.cluster.local to 127.0.0.1

seems my local busybox is not using the search config when it got dots in query.

my busybox /etc/resolv.conf looks like this:

/ # cat /etc/resolv.conf
nameserver 10.233.0.3
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

so it's really wierd that it didn't use the search domain for kubedns.kube-system.

@jwfang jwfang changed the title cross-namespace dns query for RBAC environment short-form dns query *nslookup kubernetes.default* not working Jun 22, 2017
@jwfang
Copy link
Author

jwfang commented Jun 22, 2017

i am closing this.

the short form works for alpine and tutum/dnsutils.
see to be some problem with busybox, which somehow doesn't pick up the search config if it contains dot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants