-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
azure-cni returns unparseable output on IP address exhaustion, causes infinite retry loop #195
Comments
Ack! We will look and update if fix is needed.. Thanks Patrick! |
I think it's worth noting that I've been having this issue on a brand new cluster with barely any containers running on the failed node. |
@atomaras What version of azure-cni and aci-engine are you using? Also, can you please confirm if you are using server 1803 as agent VMs? |
@sharmasushant |
@atomaras networkmonitor is not valid for windows cluster. Also, I think you mistyped acs-engine version, and missed azure-cni version. Can you please provide those? |
@sharmasushant Fixed the acs-engine version. I have a k8s cluster with 1 linux master and 2 windows nodes. The azure-cni image i mentioned is the actual image acs-engine selected for my cluster. |
Can you please attach the log file C:\K\azure-vnet.log |
@sharmasushant This one? The only other cni version i see is CNIVersion 0.3.0 |
@PatrickLang Opened an issue in containernetworking/cni github repo - containernetworking/cni#571 |
Since this issue has not been fixed yet, do we have any quick fix or workaround now? Or I have to re-deploy a totally new cluster? |
We identified the cause of the issue and will fix it in next release. The issue is CNI should not write result to stdout in case of error. |
kubelet.log before
kubelet.log after
Looks like the error results are parsed correctly now. There are still errors that need investigation but at least they're correct in the logs. Here's the corresponding error from azure-vnet.log for the curious
|
@tamilmani1989 @sharmasushant - can you get a new azure-cni release with this fix? |
This fix is available in ACS-Engine now: |
Hi team, I was meeting a regression on this issue on my k8s cluster deployed by acs-engine 0.21.2, the with the similar error in the kubelet.log, another error reported as:
and when I describe pods I find the pvc of azurefile mounts error which results in Pod sandbox changed, see below:
|
This problem is happening and seems to be quick to reproduce in our cluster.
AKS Kubernetes 1.14.5, latest Windows VM image + KB4512534 installed. |
Is this an ISSUE or FEATURE REQUEST? (choose one): Issue
Which release version?: 1.0.7
Which component (CNI/IPAM/CNM/CNS): CNI
Which Operating System (Linux/Windows): Windows Server version 1803
When the IP range for a node is exhausted, the kubelet needs to be able to parse the CNI output correctly to determine the error. in IP exhaustion - the pod should be evicted and scheduled on another node. Since the output cannot be parsed - the kubelet will go into a loop trying to schedule the container forever. If there are too many doing this, it can make the node unresponsive.
Logs from kubelet:
Steps to reproduce:
kubectl apply -f https://gist.github.com/PatrickLang/0df013d20d32eb98bc57456c4f73461a
kubectl scale deploy/iis-1803 --replicas=30
kubectl get pod -o wide
to watch if any fail over to other node. they won'tc:\k\kubelet.err.log
on the nodeThe text was updated successfully, but these errors were encountered: