azure-cni returns unparseable output on IP address exhaustion, causes infinite retry loop #195

PatrickLang · 2018-07-12T23:03:57Z

Is this an ISSUE or FEATURE REQUEST? (choose one): Issue

Which release version?: 1.0.7

Which component (CNI/IPAM/CNM/CNS): CNI

Which Operating System (Linux/Windows): Windows Server version 1803

When the IP range for a node is exhausted, the kubelet needs to be able to parse the CNI output correctly to determine the error. in IP exhaustion - the pod should be evicted and scheduled on another node. Since the output cannot be parsed - the kubelet will go into a loop trying to schedule the container forever. If there are too many doing this, it can make the node unresponsive.

Logs from kubelet:

E0712 22:28:39.344476    2872 cni.go:260] Error adding network: netplugin failed but error parsing its diagnostic message "{\n    \"cniVersion\": \"0.3.0\",\n    \"interfaces\":
 [\n        {\n            \"name\": \"eth0\"\n        }\n    ],\n    \"dns\": {}\n}{\n    \"code\": 100,\n    \"msg\": \"Failed to allocate address: Failed to delegate: Failed
to allocate address: No available addresses\"\n}": invalid character '{' after top-level value
E0712 22:28:39.344476    2872 cni_windows.go:49] error while adding to cni network: netplugin failed but error parsing its diagnostic message "{\n    \"cniVersion\": \"0.3.0\",\
n    \"interfaces\": [\n        {\n            \"name\": \"eth0\"\n        }\n    ],\n    \"dns\": {}\n}{\n    \"code\": 100,\n    \"msg\": \"Failed to allocate address: Failed
to delegate: Failed to allocate address: No available addresses\"\n}": invalid character '{' after top-level value
W0712 22:28:39.344476    2872 docker_sandbox.go:372] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "iis-1803-688b96c57b-rqlxs_def
ault": netplugin failed but error parsing its diagnostic message "{\n    \"cniVersion\": \"0.3.0\",\n    \"interfaces\": [\n        {\n            \"name\": \"eth0\"\n        }\
n    ],\n    \"dns\": {}\n}{\n    \"code\": 100,\n    \"msg\": \"Failed to allocate address: Failed to delegate: Failed to allocate address: No available addresses\"\n}": invali
d character '{' after top-level value

Steps to reproduce:

Deploy a 2-node Windows cluster using acs-engine v0.19.2
Cordon one of the nodes to force all to be scheduled on 1 node
Create an IIS deployment with kubectl apply -f https://gist.github.com/PatrickLang/0df013d20d32eb98bc57456c4f73461a
Attempt to scale it to 30 kubectl scale deploy/iis-1803 --replicas=30
Uncordon the other node
run kubectl get pod -o wide to watch if any fail over to other node. they won't
Look at c:\k\kubelet.err.log on the node

The text was updated successfully, but these errors were encountered:

sharmasushant · 2018-07-12T23:43:08Z

Ack! We will look and update if fix is needed.. Thanks Patrick!

atomaras · 2018-07-21T02:43:57Z

I think it's worth noting that I've been having this issue on a brand new cluster with barely any containers running on the failed node.

sharmasushant · 2018-07-21T04:01:24Z

@atomaras What version of azure-cni and aci-engine are you using? Also, can you please confirm if you are using server 1803 as agent VMs?

atomaras · 2018-07-21T04:07:14Z

@sharmasushant
acs-engine: 0.19.5
azure-cni: containernetworking/networkmonitor:v0.0.4
k8s: 1.11.1
agent VMs: Server 1803

sharmasushant · 2018-07-21T04:15:25Z

@atomaras networkmonitor is not valid for windows cluster. Also, I think you mistyped acs-engine version, and missed azure-cni version. Can you please provide those?

atomaras · 2018-07-21T05:07:43Z

@sharmasushant Fixed the acs-engine version. I have a k8s cluster with 1 linux master and 2 windows nodes. The azure-cni image i mentioned is the actual image acs-engine selected for my cluster.

sharmasushant · 2018-07-21T05:20:46Z

Can you please attach the log file C:\K\azure-vnet.log
It should contain the version of azure-cni v1.0.xxx

atomaras · 2018-07-21T07:56:19Z

@sharmasushant This one?
2018/07/20 20:52:57 [cni-net] Plugin azure-vnet version v1.0.7.

The only other cni version i see is CNIVersion 0.3.0

tamilmani1989 · 2018-07-23T18:30:40Z

@PatrickLang Opened an issue in containernetworking/cni github repo - containernetworking/cni#571

Cherishty · 2018-07-27T02:55:10Z

Since this issue has not been fixed yet, do we have any quick fix or workaround now? Or I have to re-deploy a totally new cluster?

tamilmani1989 · 2018-07-30T19:59:29Z

We identified the cause of the issue and will fix it in next release. The issue is CNI should not write result to stdout in case of error.

PatrickLang · 2018-08-02T01:45:05Z

kubelet.log before

2018/08/01 22:04:51 "Start Flag true CniSucceeded false Name CNI Version v1.0.7 ErrorMessage A Compute System with the specified identifier does not exist. vnet [] 
				Context AzureCNI SubContext "
2018/08/01 22:04:51 OrchestratorDetails &{  kubectl command failed due to exit status 1}
2018/08/01 22:04:51 OSDetails &{windows    }
2018/08/01 22:04:51 SystemDetails &{0 0 0 0 0 0 }
2018/08/01 22:04:51 InterfaceDetails &{Primary 10.240.0.0/12 10.240.0.96 00:0d:3a:f9:09:bb vEthernet (Ethernet 2) 30 0 }
2018/08/01 22:04:51 BridgeDetails <nil>
E0801 22:04:53.995054    5092 cni.go:260] Error adding network: netplugin failed but error parsing its diagnostic message "{\n    \"cniVersion\": \"0.3.0\",\n    \"interfaces\": [\n        {\n            \"name\": \"eth0\"\n        }\n    ],\n    \"dns\": {}\n}{\n    \"code\": 100,\n    \"msg\": \"A Compute System with the specified identifier does not exist.\"\n}": invalid character '{' after top-level value
E0801 22:04:53.995054    5092 cni_windows.go:49] error while adding to cni network: netplugin failed but error parsing its diagnostic message "{\n    \"cniVersion\": \"0.3.0\",\n    \"interfaces\": [\n        {\n            \"name\": \"eth0\"\n        }\n    ],\n    \"dns\": {}\n}{\n    \"code\": 100,\n    \"msg\": \"A Compute System with the specified identifier does not exist.\"\n}": invalid character '{' after top-level value

kubelet.log after

2018/08/02 01:14:02 "Start Flag true CniSucceeded false Name CNI Version v1.0.10-2-g14efabc-dirty ErrorMessage  vnet [] 
				Context AzureCNI SubContext "
2018/08/02 01:14:02 OrchestratorDetails &{  kubectl command failed due to exit status 1}
2018/08/02 01:14:02 OSDetails &{windows    }
2018/08/02 01:14:02 SystemDetails &{0 0 0 0 0 0 }
2018/08/02 01:14:02 InterfaceDetails &{Primary 10.240.0.0/12 10.240.0.96 00:0d:3a:f9:09:bb vEthernet (Ethernet 2) 30 0 }
2018/08/02 01:14:02 BridgeDetails <nil>
E0802 01:14:04.307700    4616 cni.go:268] Error adding network: A Compute System with the specified identifier does not exist.
E0802 01:14:04.307700    4616 cni_windows.go:49] error while adding to cni network: A Compute System with the specified identifier does not exist.

Looks like the error results are parsed correctly now. There are still errors that need investigation but at least they're correct in the logs.

Here's the corresponding error from azure-vnet.log for the curious

2018/08/02 01:14:01 [cni-net] Plugin azure-vnet version v1.0.10-2-g14efabc-dirty.
2018/08/02 01:14:01 [cni-net] Running on windows
2018/08/02 01:14:01 [net] Network interface: {Index:12 MTU:1500 Name:vEthernet (Ethernet 2) HardwareAddr:00:0d:3a:f9:09:bb Flags:up|broadcast|multicast} with IP addresses: [fe80::4ff:f6a0:4d9a:b291/64 10.240.0.96/12]
2018/08/02 01:14:01 [net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP addresses: [::1/128 127.0.0.1/8]
2018/08/02 01:14:01 [net] Network interface: {Index:5 MTU:1500 Name:vEthernet (nat) HardwareAddr:00:15:5d:b2:61:10 Flags:up|broadcast|multicast} with IP addresses: [fe80::e033:81f0:4fc2:1b1a/64 172.21.224.1/20]
2018/08/02 01:14:01 [net] reboot time 0001-01-01 00:00:00 +0000 UTC store mod time 2018-08-02 01:13:58.4576856 +0000 UTC
2018/08/02 01:14:01 [net] Restored state, &{Version:v1.0.10-2-g14efabc-dirty TimeStamp:2018-08-02 01:13:58.4516939 +0000 UTC ExternalInterfaces:map[vEthernet (Ethernet 2):0xc042165980] store:0xc04210a8a0 Mutex:{state:0 sema:0}}
2018/08/02 01:14:01 External Interface &{vEthernet (Ethernet 2) map[azure:0xc042168a80] [10.240.0.0/12]  00:0d:3a:f9:09:bb [] [] 0.0.0.0 ::}
2018/08/02 01:14:01 network &{azure dd561022-1e37-4078-91d3-a30b26ee3124 bridge 0 [{2 {10.240.0.0 fff00000} 10.240.0.1}] map[88e3a956-eth0:0xc0421ce1e0 e5e4f423-eth0:0xc0421ce2d0 054b1b95-eth0:0xc0421ce000 63354a97-eth0:0xc0421ce0f0] 0xc042165980 false}
2018/08/02 01:14:01 endpoint &{054b1b95-eth0 b540b79d-36ab-4d56-88a3-489c9f964f31 054b1b95bf9e23330bc1db7dcee74e0b061244b71c4a404278216a0de9325d09 eth0  00:15:5d:54:18:9d [{10.240.0.99 fff00000}] [10.240.0.1] {default.svc.cluster.local [10.0.0.10 168.63.129.16]} [{{0.0.0.0 00000000} 10.240.0.1 }] 0 false}
2018/08/02 01:14:01 endpoint &{63354a97-eth0 26b883bb-5929-4181-8b45-07caebac5eeb 63354a97ca8e48605e132e1cdf8cfe2345c138ea51b2857e40d7fc7d3286f559 eth0  00:15:5d:54:18:39 [{10.240.0.119 fff00000}] [10.240.0.1] {default.svc.cluster.local [10.0.0.10 168.63.129.16]} [{{0.0.0.0 00000000} 10.240.0.1 }] 0 false}
2018/08/02 01:14:01 endpoint &{88e3a956-eth0 fc2c34c3-e8c8-4ad1-bfb3-0ee7871b5b71 88e3a95615b475593c3037c21e4515716aeff6052ad6861494bc7ac518caab31 eth0  00:15:5d:54:1c:15 [{10.240.0.105 fff00000}] [10.240.0.1] {default.svc.cluster.local [10.0.0.10 168.63.129.16]} [{{0.0.0.0 00000000} 10.240.0.1 }] 0 false}
2018/08/02 01:14:01 endpoint &{e5e4f423-eth0 277ee58f-6a6a-43f0-a0d2-0523263a5001 e5e4f42351ac8f4d645b3c4ca1772b061b0e727171fea85d5a427f06dd15cc59 eth0  00:15:5d:54:1e:fb [{10.240.0.100 fff00000}] [10.240.0.1] {default.svc.cluster.local [10.0.0.10 168.63.129.16]} [{{0.0.0.0 00000000} 10.240.0.1 }] 0 false}
2018/08/02 01:14:01 [cni-net] Plugin started.
2018/08/02 01:14:01 [cni-net] Processing ADD command with args {ContainerID:9441e82f9379324696a34449317b9f013467db91780adf5babcd6f993280f538 Netns:container:e5e4f42351ac8f4d645b3c4ca1772b061b0e727171fea85d5a427f06dd15cc59 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=whoami-1803-78fd64846f-9r5qz;K8S_POD_INFRA_CONTAINER_ID=9441e82f9379324696a34449317b9f013467db91780adf5babcd6f993280f538 Path:c:\k\azurecni\bin}.
2018/08/02 01:14:01 [cni-net] Read network configuration &{CNIVersion:0.3.0 Name:azure Type:azure-vnet Mode:bridge Master: Bridge:azure0 LogLevel: LogTarget: MultiTenancy:false EnableSnatOnHost:false Ipam:{Type:azure-vnet-ipam Environment: AddrSpace: Subnet: Address: QueryInterval:} DNS:{Nameservers:[10.0.0.10 168.63.129.16] Domain: Search:[svc.cluster.local] Options:[]} AdditionalArgs:[{Name:EndpointPolicy Value:[123 34 69 120 99 101 112 116 105 111 110 76 105 115 116 34 58 91 34 49 48 46 50 52 48 46 48 46 48 47 49 50 34 44 34 49 48 46 50 52 48 46 48 46 48 47 49 50 34 93 44 34 84 121 112 101 34 58 34 79 117 116 66 111 117 110 100 78 65 84 34 125]} {Name:EndpointPolicy Value:[123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 46 48 46 48 46 48 47 49 54 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 44 34 84 121 112 101 34 58 34 82 79 85 84 69 34 125]}]}.
2018/08/02 01:14:01 PrimaryInterfaceIdentifier :<nil>
2018/08/02 01:14:01 Trying to retrieve endpoint id e5e4f423-eth0
2018/08/02 01:14:01 [net] Found existing endpoint through hcsshim: &{Id:277ee58f-6a6a-43f0-a0d2-0523263a5001 Name:e5e4f423-eth0 VirtualNetwork:dd561022-1e37-4078-91d3-a30b26ee3124 VirtualNetworkName:azure Policies:[[123 32 34 69 120 99 101 112 116 105 111 110 76 105 115 116 34 32 58 32 91 32 34 49 48 46 50 52 48 46 48 46 48 47 49 50 34 44 32 34 49 48 46 50 52 48 46 48 46 48 47 49 50 34 32 93 44 32 34 84 121 112 101 34 32 58 32 34 79 117 116 66 111 117 110 100 78 65 84 34 32 125] [123 32 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 32 58 32 34 49 48 46 48 46 48 46 48 92 47 49 54 34 44 32 34 78 101 101 100 69 110 99 97 112 34 32 58 32 116 114 117 101 44 32 34 84 121 112 101 34 32 58 32 34 82 79 85 84 69 34 32 125] [123 32 34 84 121 112 101 34 32 58 32 34 76 50 68 114 105 118 101 114 34 32 125]] MacAddress:00-15-5D-54-1E-FB IPAddress:10.240.0.100 DNSSuffix:default.svc.cluster.local DNSServerList:10.0.0.10,168.63.129.16 GatewayAddress:10.240.0.1 EnableInternalDNS:false DisableICC:false PrefixLength:12 IsRemoteEndpoint:false Namespace:<nil>}
2018/08/02 01:14:01 [net] Attaching ep 277ee58f-6a6a-43f0-a0d2-0523263a5001 to container 9441e82f9379324696a34449317b9f013467db91780adf5babcd6f993280f538
2018/08/02 01:14:01 [cni-net] Failed to hot attach shared endpoint[277ee58f-6a6a-43f0-a0d2-0523263a5001] to container [9441e82f9379324696a34449317b9f013467db91780adf5babcd6f993280f538], err:A Compute System with the specified identifier does not exist..
2018/08/02 01:14:01 handleConsecutiveAdd failed with error A Compute System with the specified identifier does not exist.
2018/08/02 01:14:01 [cni-net] ADD command completed with result:Interfaces:[{Name:eth0 Mac: Sandbox:}], DNS:{Nameservers:[] Domain: Search:[] Options:[]} err:A Compute System with the specified identifier does not exist..
2018/08/02 01:14:01 Failed to execute network plugin, err:A Compute System with the specified identifier does not exist..
2018/08/02 01:14:01 Report plugin error
2018/08/02 01:14:04 SendReport failed due to [Azure CNI] HTTP Post returned statuscode 500
2018/08/02 01:14:04 [cni-net] Plugin stopped.

PatrickLang · 2018-08-02T01:45:50Z

@tamilmani1989 @sharmasushant - can you get a new azure-cni release with this fix?

PatrickLang · 2018-08-14T01:06:07Z

This fix is available in ACS-Engine now:
https://github.com/Azure/acs-engine/releases/tag/v0.20.9

Cherishty · 2018-12-13T07:37:59Z

Hi team, I was meeting a regression on this issue on my k8s cluster deployed by acs-engine 0.21.2, the with the similar error in the kubelet.log, another error reported as:

Failed to initialize key-value store of network plugin, err:Store is locked.

and when I describe pods I find the pvc of azurefile mounts error which results in Pod sandbox changed, see below:

Events:
  Type     Reason                 Age                From                   Message
  ----     ------                 ----               ----                   -------
  Normal   SandboxChanged         13m (x25 over 1h)  kubelet, 37359k8s9001  Pod sandbox changed, it will be killed and re-created.
  Normal   SuccessfulMountVolume  9m                 kubelet, 37359k8s9001  MountVolume.SetUp succeeded for volume "log"
  Normal   SuccessfulMountVolume  9m                 kubelet, 37359k8s9001  MountVolume.SetUp succeeded for volume "default-token-bkpkh"
  Warning  FailedMount            9m                 kubelet, 37359k8s9001  MountVolume.SetUp failed for volume "pvc-211d3e2f-eda4-11e8-a14f-000d3a408701" : exit status 1
  Normal   SuccessfulMountVolume  9m                 kubelet, 37359k8s9001  MountVolume.SetUp succeeded for volume "pvc-211d3e2f-eda4-11e8-a14f-000d3a408701"
  Normal   SandboxChanged         2m (x2 over 6m)    kubelet, 37359k8s9001  Pod sandbox changed, it will be killed and re-created.

ngrigoriev · 2019-09-09T13:57:51Z

This problem is happening and seems to be quick to reproduce in our cluster.

E0909 12:47:27.938338    6072 cni_windows.go:50] error while adding to cni network: invalid character '{' after top-level value
W0909 12:47:27.938338    6072 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "********": invalid character '{' after top-level value
2019/09/09 12:47:28 Formatted Boot time: 2019-09-06 17:06:01
2019/09/09 12:47:28 Connected to telemetry service
2019/09/09 12:47:28 [Azure-Utils] ver
I0909 12:47:28.369804    6072 fake_cpu_manager.go:45] [fake cpumanager] RemoveContainer (container id: 32ef256a8bc968aef122e9c4b8f78379460ef50ab5aabcd9b559121ca3ad3619)
I0909 12:47:28.370806    6072 fake_cpu_manager.go:45] [fake cpumanager] RemoveContainer (container id: 6e8aedba55d6639f42c66a5674dd1c04c9c71fb0687c4eed7e7775104f863d82)
I0909 12:47:28.370806    6072 fake_cpu_manager.go:45] [fake cpumanager] RemoveContainer (container id: 7ed04d04acb645b8d99a0b926559089dd12d1c5a04f55c5bb817657cc5314c09)
E0909 12:47:28.370806    6072 pod_workers.go:190] Error syncing pod dc207ac8-d2ce-11e9-9c76-a62f07b3f53e ("**********(dc207ac8-d2ce-11e9-9c76-a62f07b3f53e)"), skipping: failed to "StartContainer" for "*******" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=******** pod=********(dc207ac8-d2ce-11e9-9c76-a62f07b3f53e)"
E0909 12:47:28.370806    6072 pod_workers.go:190] Error syncing pod d6bf2baf-d2ce-11e9-9c76-a62f07b3f53e ("**********(d6bf2baf-d2ce-11e9-9c76-a62f07b3f53e)"), skipping: failed to "StartContainer" for "********" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=******** pod=************(d6bf2baf-d2ce-11e9-9c76-a62f07b3f53e)"
I0909 12:47:28.549806    6072 fake_cpu_manager.go:40] [fake cpumanager] AddContainer (pod: *******, container:********, container id: de1dc01965ed9756db581642846ef4d7b96c79a0b73eb6f7539390f0d9cce75c)
2019/09/09 12:47:28 Formatted Boot time: 2019-09-06 17:06:01
2019/09/09 12:47:28 Connected to telemetry service

AKS Kubernetes 1.14.5, latest Windows VM image + KB4512534 installed.

sharmasushant assigned tamilmani1989 Jul 12, 2018

This was referenced Jul 19, 2018

K8S windows - pods are crashing due to CNI panic Azure/acs-engine#3389

Closed

The cluster-internal DNS server cannot be used from Windows containers Azure/acs-engine#2027

Closed

PatrickLang mentioned this issue Jul 30, 2018

Pod will not start because of Unexpected end of JSON input kubernetes/kubernetes#65498

Closed

tamilmani1989 mentioned this issue Jul 30, 2018

Fix for unparseable error returned by Azure CNI #212

Merged

PatrickLang mentioned this issue Aug 2, 2018

azure-cni may leak IP allocations after failing to ADD them to a pod #214

Closed

PatrickLang closed this as completed Aug 14, 2018

MarcPow mentioned this issue Oct 12, 2018

Azure CNI fails to startup/teardown pods with a parse failure error #260

Closed

sophomeric mentioned this issue Oct 22, 2020

New pods failing to start with FailedCreatePodSandBox warning for CNI versions 1.7.x with Cilium aws/amazon-vpc-cni-k8s#1265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azure-cni returns unparseable output on IP address exhaustion, causes infinite retry loop #195

azure-cni returns unparseable output on IP address exhaustion, causes infinite retry loop #195

PatrickLang commented Jul 12, 2018

sharmasushant commented Jul 12, 2018

atomaras commented Jul 21, 2018

sharmasushant commented Jul 21, 2018

atomaras commented Jul 21, 2018 •

edited

Loading

sharmasushant commented Jul 21, 2018

atomaras commented Jul 21, 2018

sharmasushant commented Jul 21, 2018 •

edited

Loading

atomaras commented Jul 21, 2018 •

edited

Loading

tamilmani1989 commented Jul 23, 2018 •

edited

Loading

Cherishty commented Jul 27, 2018

tamilmani1989 commented Jul 30, 2018

PatrickLang commented Aug 2, 2018

PatrickLang commented Aug 2, 2018

PatrickLang commented Aug 14, 2018

Cherishty commented Dec 13, 2018

ngrigoriev commented Sep 9, 2019

azure-cni returns unparseable output on IP address exhaustion, causes infinite retry loop #195

azure-cni returns unparseable output on IP address exhaustion, causes infinite retry loop #195

Comments

PatrickLang commented Jul 12, 2018

sharmasushant commented Jul 12, 2018

atomaras commented Jul 21, 2018

sharmasushant commented Jul 21, 2018

atomaras commented Jul 21, 2018 • edited Loading

sharmasushant commented Jul 21, 2018

atomaras commented Jul 21, 2018

sharmasushant commented Jul 21, 2018 • edited Loading

atomaras commented Jul 21, 2018 • edited Loading

tamilmani1989 commented Jul 23, 2018 • edited Loading

Cherishty commented Jul 27, 2018

tamilmani1989 commented Jul 30, 2018

PatrickLang commented Aug 2, 2018

PatrickLang commented Aug 2, 2018

PatrickLang commented Aug 14, 2018

Cherishty commented Dec 13, 2018

ngrigoriev commented Sep 9, 2019

atomaras commented Jul 21, 2018 •

edited

Loading

sharmasushant commented Jul 21, 2018 •

edited

Loading

atomaras commented Jul 21, 2018 •

edited

Loading

tamilmani1989 commented Jul 23, 2018 •

edited

Loading