-
Notifications
You must be signed in to change notification settings - Fork 560
The cluster-internal DNS server cannot be used from Windows containers #2027
Comments
@chweidling I will just let you know that you are not alone. My team and I have been batteling this today day with no luck at all. I think @JiangtianLi is looking into it (or at least similar issues). A quick search and look around the issues, shows that there are multiple problems with windows DNS and network right now. |
I face an issue which sounds similar. I'm on AzureCloudGermany. However, I've troubles with linux-based (Ubuntu, Debian, Alpine) containers when it comes to DNS resolution, but only with multi-agent cluster. When only having one k8s agent node, this seems not to be a problem. |
Hi, we are facing the same issue described from @chweidling . We have an hybrid cluster with both linux and windows nodes and only the windows node suffers to this problem. @ITler yes, it seems that your issue is different... maybe it is better open a new issue ;) |
I can confirm I'm seeing the exact behavior, dns doesnt work on kubernetes containers (if i créate container on the node using docker it Works) |
@ITler Is your multi-agent cluster Linux only or hybrid? If it is Linux only, please file a different issue. |
Maybe this helps in diagnosing the issue: |
Nice catch @Josefczak !
|
Josefczak thanks! |
I don't have much to add here. I just came across this after much searching. I have the same timeout issue connecting to 10.0.0.10 using nslookup. While setting the containers dns is a solution, having to muck about with my container entrypoint to work around this issue doesn't seem like the greatest solution. Fortunately we are still in an early testing phase. Is there a bug to track somewhere for this specific issue? |
@esheris I guess you are looking at it |
Oh man, this solution @Josefczak found is what I've been looking for literally for 2 months. :-) I can get this to work if I manually connect to my pods, but am struggling with the dockerfile commands to automate this. Can anyone offer nanoserver dockerfile commands that work? (ie: no powershell required!) |
@brobichaud You can try netsh, e.g., https://technet.microsoft.com/pt-pt/library/cc731521(v=ws.10).aspx#BKMK_setdnsserver |
Yeah I did see that in the thread above but the problem is that it requires the interface name, which appears to be unique to the pod. Surely someone has already automated this in a dockerfile. This is a HUGE fix for a longstanding DNS issue in 1709 for me. |
The only solution I can come up with is to modify my containers entrypoint to be a powershell script that runs the above commands then executes what I want to really run, in my case I ended up having some other things I needed to do with my web.config that now I have my docker file like so:
entrypoint.ps1 essentially looks like this
|
@esheris @JiangtianLi I was able to come up with the PowerShell commands for servercore much like you have (though I put them inline in the dockerfile) but when I deploy my pod the DNS server hasn't changed. I suspect a permissions problem in the dockerfile. It's like it runs the commands but they fail to apply. I can still remote into my pod and manually issue the same commands and then my already running app suddenly starts working. Here is the relevant snippet from my dockerfile:
Does anyone know how to correctly elevate permissions in the dockerfile for Windows? |
You can't really do this in the dockerfile directly as the underlying nic of the container will change and you are setting the dns based on it. This is why I had to modify my container entrypoint. you have to set dns when the container starts. |
Ahhh, I see. That does explain why it failed to work in my dockerfile. Ugh, your workaround is ingenious but so ugly and feels so hacky. Alas it DOES work, and I thank you @esheris! A couple of questions maybe you can answer:
|
I certainly agree that it feels hacky, I expressed similar in my original post I just pulled microsoft/nanoserver:latest and launched it (docker run -it microsoft/nanoserver:latest powershell) and it seems get-netadapter and set-dnsclienterveraddress/set-dnsclient are there |
Unfortunately nanoserver:latest is Server 1607, and I really need Server 1709 (yeah wierd decision on Microsofts part). Server 1709 removed PowerShell support. :-( I'll continue iterating on it and post a response here if I come up with a solution for nanoserver 1709. Or I may resort to using the new powershell core in nanoserver 1709. |
You could assume the nic name which should always be the same and set it with netsh that way
run/exec into your container and validate its name first, "Ethernet" was what I had in my previously mentioned nanoserver container |
I can see if I open a new nanoserver container locally the name is always "Ethernet" but the interface name appears to be dynamic in an ACS k8s pod. For example mine is now:
But to prove this even can work with netsh I opened a command prompt in my pod and tried to do it manually, the result is:
Do you know how to elevate a command prompt in a container/pod? |
Argh. Roadblocked here with nanoserver. The elevation issue has prevented me from pursuing the netsh approach. I cannot find anything on how I can elevate to admin in a nanoserver command prompt. So then I thought maybe I'd explore the PowerShell Core path with nanoserver since I've got a script that works on servercore. Alas PowerShell Core does not support Set-DnsClientServerAddress. I suspect because that cmdlet is very Windows specific and Core is designed as x-plat. Dead-end. I can of course migrate my DotNet Core app to run on servercore, which I don't really want as it feels like a step backwards. And it means automating the install of DotNet Core since there is no pre-built servercore image with DotNet Core. I gotta say, nanoserver is easy to love and yet even easier to hate. :-( |
Runas is the general command I believe. Not sure what you would run it as though. Perhaps try setting up the entry point script with the netsh commands in it, perhaps being launched out of the main container process will give you perms
…________________________________
From: Brett Robichaud <[email protected]>
Sent: Tuesday, February 6, 2018 4:13:38 PM
To: Azure/acs-engine
Cc: esheris; Mention
Subject: Re: [Azure/acs-engine] The cluster-internal DNS server cannot be used from Windows containers (#2027)
Argh. Roadblocked here with nanoserver. The elevation issue has prevented me from pursuing the netsh approach. I cannot find anything on how I can elevate to admin in a nanoserver container command prompt.
So then I thought maybe I'd explore the PowerShell Core path with nanoserver since I've got a script that works on servercore. Alas PowerShell Core does not support Set-DnsClientServerAddress. I suspect because that cmdlet is very Windows specific and Core is designed as x-plat.
Dead-end. I can of course migrate my DotNet Core app to run on servercore, which I don't really want as it feels like a step backwards. And it means automating the install of DotNet Core since there is no pre-built servercore image with DotNet Core.
I gotta say, nanoserver is easy to love and yet even easier to hate. :-(
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#2027 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AHAf_aROl3_8Gl6Yc9EwmCVq93NyUXPCks5tSOqygaJpZM4RZzhA>.
|
A good suggestion @esheris on the idea of running an entrypoint script. Alas I tried and see the same error about elevation being required. Feels like I am so close as I have discovered that the interface index is consistently 30, so if I had permissions I could use this command to set the DNS server:
As for runas, it does not exist in nanoserver. Blocked by nanoserver at every path it feels! I may have to step back and move my nanoserver use to servercore until Microsoft gets this fixed. Sooo not what I want to do, I really want to get some legs on nanoserver as we are building up this greenfield app, not migrate it later and see what breaks all at once! :-( |
This one is of interest #2230 In the mean time I have this a workaround to find the current IP addresses of kube-dns pods. |
just so anyone runs into the same issue, for me the workaround didnt work, unless I added |
I looked through #2230 and it does look interesting but its not clear to me that it addresses this issue. Clearly there are other DNS issues in Windows 1709 itself, but I wonder if the problem we are seeing is in fact Windows or the way k8s is setup with Windows nodes? This just feels like such a huge roadblocker of an issue that it should be of highest priority to get fixed. |
Yeah it is a huge blocker. I actually pulled down the pull request and merged in the latest changes from acs-engine\master and built it. Still no DNS resolution from 1709 containers... |
The DNS issue does not seem unique to Windows (#2999, #2880) as of a few days ago. I am not able to access external address from windows or linux pods with the latest release. Update: I had made a mistake on the external calls, which are working, but DNS is not resolving internal for windows nodes as @sam-cogan pointed out unless I go specify the pod dns:
|
acs-engine 18.8, this still is an issue. Some containers can reach the kube-dns and some cannot. AKS has moved to GA and still no windows containers support. Does Kubernetes in Azure actually work. Incredibly frustrating experience for last 6 months trying to convince people this platform will work reliably. |
@SteveCurran technically AKS has nothing to do with this (even though it's using ACS engine behind the scenes). But overall yeah. This is a trainwreck. And it's getting worse. |
It does seem to be getting worse, and inconsistently so. I have a windows init container that works fine calling out to the internet, then the main container it spawns can't resolve anything, it makes no sense. |
Any updates on this? |
Since moving to asc engine release 0.19 and using 1803 images this seems to have improved significantly. DNS is resolving as expected and had stayed that way for some time. |
I have moved to 19.3 and still using 1709 and I am seeing much more stability. Still using the DNS addresses to the individual DNS pods. The DNS server seems to get overwhelmed when trying to push many deployments at once. If I push deployments slowly then all is well. |
@SteveCurran how did you move to 19.3? I have a cluster that I would like to upgrade and I am already running Kubernetes 1.11.0, the upgrade command is a no-op as there is nothing to upgrade. 0.19.2 seems to have some networing changes that I am hoping will solve my current issues but am unsure how to actually do this. I could always drop & re-create a complete new cluster, but that is a lot of work and the main thing I am unsure on is keeping the ingress LB IP. :-) |
@ocdi I dropped and recreated. |
I have the same problem. Cluster is unusable. Used acs-engine 0.19.1, windows server 1803, k8s version 1.11.0-rc.3 |
This is fixed in acs-engine 0.19.2. If you're still hitting it in that version or later, can you share details? Otherwise, can we close this issue? |
@PatrickLang This may be a silly question, but how do we upgrade an existing cluster to 0.19.2? I can see how to upgrade if I am changing the kubernetes version, however I used 0.19.1 to upgrade to k8s 1.11.0 already and it is a no-op as I am already at the target version. |
@PatrickLang does this require running 1803 on the host and in the container? When using 1709 host and container we still need to use the individual ip addresses of the DNS pods and not 10.0.0.10. |
@PatrickLang doesn't work. acs-engine v1.19.3, k8s 1.11.0_rc3, server 1803. Windows pods cant access kubedns. |
@atomaras could you try with 1.19.5 and 1.11.1 and server 1803? I was able to successfully do DNS queries from windows and linux pods with those version. |
@jsturtevant Can I simply use acs-engine upgrade or do I have to recreate the cluster? |
I usually drop and recreate to make sure everything is deployed properly. |
@jsturtevant viable production approach |
I upgraded an existing cluster from 1.11.0 to 1.11.1 with acs engine 0.19.5 with 1803 and so far so good, the baseline CPU usage has dropped which is nice, from 15-20% constant to maybe 10%. Not sure what about the previous version was using so much CPU but more for containers to run, the better. Haven't observed any DNS issues so far but it's only been an hour. :-) |
@jsturtevant I recreated the cluster with k8s 1.11.1 and some windows containers work but others aren’t. I don’t know why. Specifically I run a windowsservercore:1803 busybox-style image in default namespace and DNS worked. Then I run my windows jenkins agent image based on dotnet framework 1803 inside jenkins namespace and it didn’t work (same as before). Some extra observations: 1) aci-networking container still gets scheduled on windows nodes and fails so I have to patch the deployment and 2) initially I tried upgrading the cluster which resulted in only master node becoming 1.11.1 but other nodes remained at 1.11.0-rc3 so I ended up recreating the cluster |
@atomaras I believe the issue your seeing is because you are in a separate namespace with the second pod. Could exec into the pod in the jenkins namespace and run Additionally what happens when you run the jenkins deployment in the defualt namespace? @ocdi Thanks for the update. If you see pods drop network connectivity/DNS over a given time drop a note here. |
Yes, there's a problem where only the pod's namespace is added to the DNS suffix resolution list. |
Here's the issue for the incomplete DNS suffix list: Azure/azure-container-networking#206 |
@atomaras - the failure you highlighted above is due to IP address exhaustion "Failed to allocate address: … No available addresses" The error isn't being handled correctly due to Azure/azure-container-networking#195 |
Thank you @PatrickLang ! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead. |
Is this a request for help?: NO
Is this an ISSUE or FEATURE REQUEST? (choose one): ISSUE
What version of acs-engine?: canary, GitCommit 8fd4ac4
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) kubernetes 1.8.6
What happened:
I deployed a simple cluster with one master node and two Windows nodes. In this deployment, requests to the cluster's own DNS server kubedns time out. Requests to DNS servers work.
Remark: This issue is somehow related to #558 and #1949. The related issues suggest that the DNS problems have a relation to the Windows dnscache service or to the custom VNET feature. But the following description points to a different direction.
What you expected to happen: Requests to the internal DNS server should not time out.
Steps to reproduce:
Deploy a simple kubernetes cluster with one master node and two Windows nodes with the following api model:
Then run a Windows container. I used the following command:
kubectl run mycore --image microsoft/windowsservercore:1709 -it powershell
Then run the following nslookup session, where you try to resolve a DNS entry with the default (internal) DNS server and then with Google's DNS server:
Anything else we need to know:
As suggested in #558, the problem should vanish 15 minutes after a pod has started. In my deployment, the problem does not disapper even after one hour.
I observed the behavior independent from the values of the
networkPolicy
(none, azure) andorchestratorRelease
(1.7, 1.8, 1.9) properties in the api model. With the model above, I get the following network configuration inside the Windows pod:The text was updated successfully, but these errors were encountered: