-
Notifications
You must be signed in to change notification settings - Fork 560
k8s: DNS resolution problems on windows nodes #558
Comments
@ajorkowski What is the output of |
@JiangtianLi When I mention 'service' I am talking about the kubernetes services, so we have some that are pointing to windows deployments, and some that are pointing to linux deployments, but I haven't seem any problems with the connectivity between the windows <-> linux nodes (not confirmed though) and from the outside internet -> containers (confirmed). This seems to be a problem with accessing the outside internet from within the containers on windows nodes. Later today I'll do some debugging and get the answers to your questions. |
I'm having this exact same issue. After a (random) while the DNS lookup fails and never comes back fully until a Also the problem appears to be occuring with external hostnames most of the time (mailgun, Azure storage and SQL) As a workaround for now I have added a scheduled task (every 30 mins) to flush the DNS of my apps containers.
|
Ok, here is another example (this time is it cdn.raygun.io and not the table services).
Here is the
And just for completeness, I can query other urls:
|
This is interesting, if I do a
It looks like there may be intermittent failures and they are being cached? |
@ajorkowski The failure is intermittent. I could repro once and as @skinny said, after |
I have been using @skinny fix for the last day now (ie using the scheduler on the node to flush the dns in each container) and haven't had any problems so far. It is probably not that ideal though.... |
Yeah I needed to shorten the interval to 10 minutes to avoid any issues yesterday but it does the job until we get a proper fix Im also going to experiment with the Windows DNS cache a bit to see if that helps @ajorkowski I've modified my container run script to include the following two lines. These will disable all DNS caching inside the container so I don't need to flush it anymore and have even a small window of failed lookups
Depending on the type of application, this might be useful for you too while we wait for a real fix |
@skinny Awesome, I think for right now the 10 min ping is sufficient as we are really just running a test server at the moment, but its good to know that there is an alternative that is even more reliable. Thanks for sharing. |
After a day or two I did this same issue occur in a v1.5.3 cluster. |
Just an update on this - I recently upgraded our cluster using 0.5.0 acs engine (1.7.2 kubernetes) and tried turning the DNS cache back on and was still getting this issue. Still no idea what is causing it - it is like the DNS temporarily fails and then the cache seems to cause that failure to 'stick'. |
I found setting the pod dnspolicy to default stopped my timeout issues. This is fine for me as I only need to resolve public DNS records. If you need cluster name resolution, try disabling the negative dns cache in your container. If you do get a DNS timeout, the bad record won't get stuck in your cache and usually resolves correctly next time. |
@Dm3r yes disabling DNS cache is what I have been using for a couple of months now (see my previous comments). However I am still experiencing lookup failures which causes the container to fail starting completely. After a couple of restarts (random up until 20 restarts) the lookup I need at application startup succeeds and the container starts fine. So this issue is still present for me too unfortunately |
I didn't explain properly but you can disable caching for failed DNS queries but still cache successful queries. Might be useful in certain situations. Are you using internal cluster lookups or do you just need to get out to the internet? |
I need both internal and external lookups. As the lookups are pretty quick I am just disabling all caching because that works at least for my use case. Still the failing lookups are the real issue ;-) |
I too am experiencing this with clusters I build today in ACS via the Portal or CLI. I can repro this easily. The commands skinny added to his dockerfile to disable dnscache have worked for me as a temporary fix. I'd love to see this get some attention from Microsoft, I am willing to assist with data if needed. |
Is there any update on this issue? We're hitting exactly this problem where DNS seems to just go after a period and then that status sticks until the pod is restarted. |
I am still experiencing this issue and it's pretty frustrating that nothing has changed since April. I have k8s 1.7.9 clusters experiencing this. I have clusters with server 2016 ltsc and clusters with Server 2016 v1709 both experiencing this. I have found that turning off the dnscache service works well but this is strongly not advised for production workloads. I have found this technique does not work with v1709 servers, I think there is a permission issue as I am unable to disable the service at all on v1709 nanoservers. |
@brobichaud As you already know, acs-engine v0.9.2 and after is using Windows 1709 instead of Windows Server 2016. Windows 1709 has a different DNS issue but it has been fixed and should be out in the next Windows update. |
/cc @madhanrm |
@JiangtianLi is there any chance you could circle back around and update this thread once the v1709 update you mention is being used by new cluster creations? I'd be happy to try my scenario once that is available. |
@brobichaud Sure, will do. Currently the DNS issue is on 1709 is random (due to race condition) and happens only in the first 15 minutes after container starts up. If you still have DNS issue after that, then it should be due to a different root cause and please let us know. |
What do you recommend I do if i have a StartUp.ps that currently fails to make a http request via the "Invoke Web-Request" module? I have been fighting this since I create a cluster in ACS. Do i need to disable DNS? If so how? |
@JiangtianLi Disabling cache in windows container not allowed:
Deployed "simpleweb.yaml" service following these instructions:
The service & pod has been up for 21 minutes (Screenshot). Not able to ping a server in the same vnet, that I am able to ping from linux containers in the cluster:
Not able to curl internal dns of other service:
|
This issue has become incredibly frustrating. I am using ACS 1.7.7 and can only occasionally resolve a remote URL even with dns caching turned off. This is preventing us from moving forward with ACS. |
@SteveCurran Sorry for the inconvenience. There is some update in another thread: #2027. With the next Windows/docker image update, it will mitigate DNS issue in most scenarios. |
@SteveCurran I was able to overcome this issue in a new cluster using the following: acs-engine: 0.13.0 A couple things I learned when trying to update my cluster using these instructions: When i tried updating the 1.7/0.11.0 cluster to 1.8.4 using 0.13.0, it failed almost immidiately, so the update process has changed between the two acs-engine versions and has introduced new fields somewhere in the engine's output folder. When I tried updating 1.7/0.11.0 to 1.8.4 using the 0.11.0 engine, it failed during the ARM update steps. I have yet to debug the output (it didn't give me a helpful error). So at this point I'm looking at:
Side note/rant: My cluster needs to communicate with other external servers in the same vnet. These servers take a lot of time and effort to configure once they're provisioned (due to the legacy applications that need to run on the servers). Hybrid clusters using the acs-engine do not currently support deploying to an existing vnet. So I'm now on my third time deploying a new cluster, new vnet, and new legacy servers. I'm keeping my fingers crossed that this gets fixed before we need to update a production cluster because at this point it's going to require a complete rebuild of the vnet and all the resources within it. |
@patrick-motard did you manage to mitigate the issue with a 1.8.4 cluster? I'm running a 1.8.8 cluster now, and one of my pods seems to have internet access (I'm using Azure AD amongst others), but another pod still cannot resolve the host name, so the problem isn't going away. I have the same issues you have, I have to rely on an Azure admin in my company to the cluster according to my acs-engine templates (rights issue), and I've had to bother them too many times already :( @JiangtianLi Is there a way to obtain the new image with the fix already? |
@roycornelissen after working on this issue today some more, I am still having this bug. On 1.7 I was not able to communicat with the internal k8s DNS from inside a windows container in the cluster. On 1.8.4 I can now resolve service DNS from within a windows container. I still cannot access the internet nor any ips within the vnet (outside of the cluster) from a windows container in k8s 1.8.4. |
Also @JiangtianLi, this issue does not seem to be the same as the 15 min race condition bug. My containers have been deployed for hours and are exhibiting this behavior. |
@skinny 's dns flushing quick fix does not fix this issue for me either. |
I'm having the same issue as @patrick-motard. Are there any updates to this? |
@jdinard @patrick-motard There have been a lot of networking fixes as of acs-engine 0.19.2 and 0.19.3 -- do you still see this issue there (on Windows 1803)? For Windows 1709, there is a list of known DNS issues outlined here. |
No new responses here - closing. If you hit this problem on a newer deployment, please open a new issue. |
Hitting this problem on a new deployment, will file a new issue. |
I'm using a fairly recent commit (#520 d852aba) of acs-engine with kubernetes 1.6.2 including the winnat commit - however unfortunately I'm seeing some issues with DNS resolution that is pretty intermittent (to be fair I think they have been happening since I have been working on it, but there have been some other blocking issues)
Everything is working fairly well (I'm really enjoying the fast spin up time) and then all of a sudden calls will start to timeout in our backend. When I ssh into the box and run some powershell scripts I get something like the following:
These commands were executed right after each other, so I don't understand why the name is not being resolved?
If I do a
ipconfig /flushdns
call the curl command (and my app) works again. After a little while (hour or two?) things will start to fail again.I can confirm that not every external service is failing, I am able to curl other urls without an issue while this is happening. And I haven't seem the problem happen with internal cluster dns names, although this happens less frequently on our test app.
Maybe I don't really understand how the DNS lookup is working here. Anything I can do to alleviate this issue or debug it further?
Here's some more info that might be relevant:
The text was updated successfully, but these errors were encountered: