-
Notifications
You must be signed in to change notification settings - Fork 560
K8S windows - pods are crashing due to CNI panic #3389
Comments
More details : I can see that the azure-cni-networkmonitor ( Daemon sets ) are failing .
And of course that /var/run/ is not a directory cause my agents are windows based . Thanks , |
Couple of new errors i'm getting while trying to start pods : Error1:
Error2:
|
@AmitDaniel-wsc Could you please share your API model? |
@idanshahar @AmitDaniel-wsc { |
The first JSON parsing error is also reported at #3155. @saiyan86 Do you know why this is happening? The second panic error is a known issue for Azure CNI plugin (see Azure/azure-container-networking#176). It has been fixed in Azure/azure-container-networking#177, and it may need sometime for its release. |
Hi I opened #3153 with what seems to be the same issues. If you check the logs on your Windows nodes, I found errors of IP exhaustion coming from the CNI plugin. |
Hi .
|
@AmitDaniel-wsc I tried to reproduce
Could you please share your kuberenetes deployment.yaml and the Dockerfile? |
Hi @idanshahar I must use the windows-server-core-1709 version what do you mean by kubernetes deployment.yaml and Dockerfile ( Which dockerfile ? ) |
@AmitDaniel-wsc The difference between my template to yours is that I removed this: the yaml that describes your pods... and the Dockerfile of your service |
@jackfrancis Can you please see why cni network monitor is starting on windows VMs? It should not be started by default, and specifically not on windows agents (even if requested). @AmitDaniel-wsc I suspect the error you are seeing is not due to Azrue CNI or wincni. Can you please share logs of Azure CNI from C:\K\azure-vnet* This will tell us the reason of failure. @madhanrm can you please take a look if HNS is failing to attach endpoint? Also, can you confirm if 1803 have some fixes that are not there in 1709 (and can cause above error)? What logs will you need to verify that? |
@sharmasushant Thanks ! |
let's use this to track network monitor on windows: #3404 |
#3405 is also reporting the json parsing error |
@sharmasushant this will prevent the cni network monitor from scheduling on Windows #3407 |
Also, bonus points for issue 3389 being a favorite Windows service port! |
@patrick,Francis there is going to be release of cni 1.0.7 today. So we can go ahead with installing the latest version. |
@digeler We're testing Azure CNI v1.0.7 robustly, stand by... |
Azure CNI v1.0.7 has landed in master |
@jackfrancis Please open this issue again . I created new cluster with the latest version of acs-engine i built ( I verified that i have all the master commits in this version ) and i'm getting the same error with the parsing json :
Update 1 : Another error :
What are we doing now ? |
@madhanrm |
Sure . |
@sharmasushant This is all the logs combined into one file. |
@AmitDaniel-wsc Are you sure that logs are from same node where you created "BLABLA"? In fact, I do not see any error in the logs that you attached. There are multiple log fiels with .0, .1, and so on. Can you pease attach all of them from the node where POD creation failed. I also do not find the logs of second POD prod-wsc-cliprouploader-57878b4f9b-7986z_wsc that you are pasted. |
@sharmasushant I created 30 pods so probably one of them was created on this Node . Also i'm not sure that the errors comes from the Nodes or from the master . Let me know if you want more logs from the master . |
@AmitDaniel-wsc Not from master. To look into the issue, we will need the logs from the node where POD creation failed. I see that 38 containers successfully got allocated IPs from Azure CNI in the logs that you shared. |
@sharmasushant It failed on this node - After couple of mins that deployment finished successfully so i'm not sure what is written to the logs . |
@AmitDaniel-wsc Ok, can you please share complete kubelet logs showing the failure with timestamps? Its strange that I don't see any calls in Azure CNI for the containers you are reporting failures for. @madhanrm Can you please take a look. Is it possible for k8s to not call CNI and assume failure in some code path? Or to think CNI failed even when it successfully finished ADD? |
@sharmasushant Kubelet.log is empty . |
Ok, from the look of it, it seems like some issue in kubernetes.. take POD prod-wsc-cliprouploader-57878b4f9b-fntdp as an example At time 7:37:34, Azure CNI clearly successfully finished ADD
However, cni in github.com/containernetworking/cni/pkg/invoke/raw_exec.go thinks that ADD has failed for some reason. The error finally bubbles up to github.com/kubernetes/pkg/kubelet/dockershim/network/cni/cni.go and shows up in kubelet We will need someone from github.com/containernetworking to help look at what went wrong below (especially given that it does not fail all the time)
|
@sharmasushant Ok |
We can post an issue to github.com/containernetworking with the above kubelet logs and the POD I mentioned and ask for their help in understanding why cni.go thinks that ADD has failed. |
@jackfrancis @sharmasushant Please reopen this ticket . |
Since we've made changes and errors are different, can we use a new issue? Is #3447 the same deployment you're using? It's better for us to track what changes were made per issue, rather than reusing old ones. |
Azure/azure-container-networking#195 is open for this error:
|
Resolves Azure#3389 / Azure#3447 Includes two important Azure-CNI changes for Windows Fix for unparseable error returned by CNI (Azure#195) Fix for IP Address leak in HNS failure scenario in windows CNI (Azure#218) Full notes at https://github.com/Azure/azure-container-networking/releases
Resolves Azure#3389 / Azure#3447 / Azure#3153 Includes two important Azure-CNI changes for Windows Fix for unparseable error returned by CNI (Azure#195) Fix for IP Address leak in HNS failure scenario in windows CNI (Azure#218) Full notes at https://github.com/Azure/azure-container-networking/releases
Resolves Azure#3389 / Azure#3447 / Azure#3153 Includes two important Azure-CNI changes for Windows Fix for unparseable error returned by CNI (Azure#195) Fix for IP Address leak in HNS failure scenario in windows CNI (Azure#218) Full notes at https://github.com/Azure/azure-container-networking/releases
Is this a request for help?:
yes
Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE
What version of acs-engine?:
Version: v0.19.0
GitCommit: 312770f
GitTreeState: clean
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes , v1.10.5
What happened:
Created a K8S cluster with 3 masters ( Linux ) and 10 agents ( Windows )
agentWindowsSku = Datacenter-Core-1709-with-Containers-smalldisk
agentWindowsVersion = 1709.0.20180524
Tried to install with AzureCNI and without AzureCNI
Once i'm starting to add pods i'm getting a weird errors :
First error :
Second error :
What you expected to happen:
Pods will start with no errors .
How to reproduce it (as minimally and precisely as possible):
Create a cluster with latest version of acs-engine + change the windows image to 1709
create a cluster with and without azureCNI and start adding pods to your cluster .
after 10-15 pods you'll get the error .
Anything else we need to know:
I tried to start the cluster with older version of acs-engine ( 0.18.5 ) and i'm still getting this error .
Thanks ,
Amit
The text was updated successfully, but these errors were encountered: