Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azure-cni returns unparseable output on IP address exhaustion, causes infinite retry loop #195

Closed
PatrickLang opened this issue Jul 12, 2018 · 16 comments
Assignees

Comments

@PatrickLang
Copy link
Contributor

Is this an ISSUE or FEATURE REQUEST? (choose one): Issue


Which release version?: 1.0.7


Which component (CNI/IPAM/CNM/CNS): CNI


Which Operating System (Linux/Windows): Windows Server version 1803


When the IP range for a node is exhausted, the kubelet needs to be able to parse the CNI output correctly to determine the error. in IP exhaustion - the pod should be evicted and scheduled on another node. Since the output cannot be parsed - the kubelet will go into a loop trying to schedule the container forever. If there are too many doing this, it can make the node unresponsive.

Logs from kubelet:

E0712 22:28:39.344476    2872 cni.go:260] Error adding network: netplugin failed but error parsing its diagnostic message "{\n    \"cniVersion\": \"0.3.0\",\n    \"interfaces\":
 [\n        {\n            \"name\": \"eth0\"\n        }\n    ],\n    \"dns\": {}\n}{\n    \"code\": 100,\n    \"msg\": \"Failed to allocate address: Failed to delegate: Failed
to allocate address: No available addresses\"\n}": invalid character '{' after top-level value
E0712 22:28:39.344476    2872 cni_windows.go:49] error while adding to cni network: netplugin failed but error parsing its diagnostic message "{\n    \"cniVersion\": \"0.3.0\",\
n    \"interfaces\": [\n        {\n            \"name\": \"eth0\"\n        }\n    ],\n    \"dns\": {}\n}{\n    \"code\": 100,\n    \"msg\": \"Failed to allocate address: Failed
to delegate: Failed to allocate address: No available addresses\"\n}": invalid character '{' after top-level value
W0712 22:28:39.344476    2872 docker_sandbox.go:372] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "iis-1803-688b96c57b-rqlxs_def
ault": netplugin failed but error parsing its diagnostic message "{\n    \"cniVersion\": \"0.3.0\",\n    \"interfaces\": [\n        {\n            \"name\": \"eth0\"\n        }\
n    ],\n    \"dns\": {}\n}{\n    \"code\": 100,\n    \"msg\": \"Failed to allocate address: Failed to delegate: Failed to allocate address: No available addresses\"\n}": invali
d character '{' after top-level value

Steps to reproduce:

  • Deploy a 2-node Windows cluster using acs-engine v0.19.2
  • Cordon one of the nodes to force all to be scheduled on 1 node
  • Create an IIS deployment with kubectl apply -f https://gist.github.com/PatrickLang/0df013d20d32eb98bc57456c4f73461a
  • Attempt to scale it to 30 kubectl scale deploy/iis-1803 --replicas=30
  • Uncordon the other node
  • run kubectl get pod -o wide to watch if any fail over to other node. they won't
  • Look at c:\k\kubelet.err.log on the node
@sharmasushant
Copy link
Contributor

Ack! We will look and update if fix is needed.. Thanks Patrick!

@atomaras
Copy link

I think it's worth noting that I've been having this issue on a brand new cluster with barely any containers running on the failed node.

@sharmasushant
Copy link
Contributor

@atomaras What version of azure-cni and aci-engine are you using? Also, can you please confirm if you are using server 1803 as agent VMs?

@atomaras
Copy link

atomaras commented Jul 21, 2018

@sharmasushant
acs-engine: 0.19.5
azure-cni: containernetworking/networkmonitor:v0.0.4
k8s: 1.11.1
agent VMs: Server 1803

@sharmasushant
Copy link
Contributor

@atomaras networkmonitor is not valid for windows cluster. Also, I think you mistyped acs-engine version, and missed azure-cni version. Can you please provide those?

@atomaras
Copy link

@sharmasushant Fixed the acs-engine version. I have a k8s cluster with 1 linux master and 2 windows nodes. The azure-cni image i mentioned is the actual image acs-engine selected for my cluster.

@sharmasushant
Copy link
Contributor

sharmasushant commented Jul 21, 2018

Can you please attach the log file C:\K\azure-vnet.log
It should contain the version of azure-cni v1.0.xxx

@atomaras
Copy link

atomaras commented Jul 21, 2018

@sharmasushant This one?
2018/07/20 20:52:57 [cni-net] Plugin azure-vnet version v1.0.7.

The only other cni version i see is CNIVersion 0.3.0

@tamilmani1989
Copy link
Member

tamilmani1989 commented Jul 23, 2018

@PatrickLang Opened an issue in containernetworking/cni github repo - containernetworking/cni#571

@Cherishty
Copy link

Since this issue has not been fixed yet, do we have any quick fix or workaround now? Or I have to re-deploy a totally new cluster?

@tamilmani1989
Copy link
Member

We identified the cause of the issue and will fix it in next release. The issue is CNI should not write result to stdout in case of error.

@PatrickLang
Copy link
Contributor Author

kubelet.log before

2018/08/01 22:04:51 "Start Flag true CniSucceeded false Name CNI Version v1.0.7 ErrorMessage A Compute System with the specified identifier does not exist. vnet [] 
				Context AzureCNI SubContext "
2018/08/01 22:04:51 OrchestratorDetails &{  kubectl command failed due to exit status 1}
2018/08/01 22:04:51 OSDetails &{windows    }
2018/08/01 22:04:51 SystemDetails &{0 0 0 0 0 0 }
2018/08/01 22:04:51 InterfaceDetails &{Primary 10.240.0.0/12 10.240.0.96 00:0d:3a:f9:09:bb vEthernet (Ethernet 2) 30 0 }
2018/08/01 22:04:51 BridgeDetails <nil>
E0801 22:04:53.995054    5092 cni.go:260] Error adding network: netplugin failed but error parsing its diagnostic message "{\n    \"cniVersion\": \"0.3.0\",\n    \"interfaces\": [\n        {\n            \"name\": \"eth0\"\n        }\n    ],\n    \"dns\": {}\n}{\n    \"code\": 100,\n    \"msg\": \"A Compute System with the specified identifier does not exist.\"\n}": invalid character '{' after top-level value
E0801 22:04:53.995054    5092 cni_windows.go:49] error while adding to cni network: netplugin failed but error parsing its diagnostic message "{\n    \"cniVersion\": \"0.3.0\",\n    \"interfaces\": [\n        {\n            \"name\": \"eth0\"\n        }\n    ],\n    \"dns\": {}\n}{\n    \"code\": 100,\n    \"msg\": \"A Compute System with the specified identifier does not exist.\"\n}": invalid character '{' after top-level value

kubelet.log after

2018/08/02 01:14:02 "Start Flag true CniSucceeded false Name CNI Version v1.0.10-2-g14efabc-dirty ErrorMessage  vnet [] 
				Context AzureCNI SubContext "
2018/08/02 01:14:02 OrchestratorDetails &{  kubectl command failed due to exit status 1}
2018/08/02 01:14:02 OSDetails &{windows    }
2018/08/02 01:14:02 SystemDetails &{0 0 0 0 0 0 }
2018/08/02 01:14:02 InterfaceDetails &{Primary 10.240.0.0/12 10.240.0.96 00:0d:3a:f9:09:bb vEthernet (Ethernet 2) 30 0 }
2018/08/02 01:14:02 BridgeDetails <nil>
E0802 01:14:04.307700    4616 cni.go:268] Error adding network: A Compute System with the specified identifier does not exist.
E0802 01:14:04.307700    4616 cni_windows.go:49] error while adding to cni network: A Compute System with the specified identifier does not exist.

Looks like the error results are parsed correctly now. There are still errors that need investigation but at least they're correct in the logs.

Here's the corresponding error from azure-vnet.log for the curious

2018/08/02 01:14:01 [cni-net] Plugin azure-vnet version v1.0.10-2-g14efabc-dirty.
2018/08/02 01:14:01 [cni-net] Running on windows
2018/08/02 01:14:01 [net] Network interface: {Index:12 MTU:1500 Name:vEthernet (Ethernet 2) HardwareAddr:00:0d:3a:f9:09:bb Flags:up|broadcast|multicast} with IP addresses: [fe80::4ff:f6a0:4d9a:b291/64 10.240.0.96/12]
2018/08/02 01:14:01 [net] Network interface: {Index:1 MTU:-1 Name:Loopback Pseudo-Interface 1 HardwareAddr: Flags:up|loopback|multicast} with IP addresses: [::1/128 127.0.0.1/8]
2018/08/02 01:14:01 [net] Network interface: {Index:5 MTU:1500 Name:vEthernet (nat) HardwareAddr:00:15:5d:b2:61:10 Flags:up|broadcast|multicast} with IP addresses: [fe80::e033:81f0:4fc2:1b1a/64 172.21.224.1/20]
2018/08/02 01:14:01 [net] reboot time 0001-01-01 00:00:00 +0000 UTC store mod time 2018-08-02 01:13:58.4576856 +0000 UTC
2018/08/02 01:14:01 [net] Restored state, &{Version:v1.0.10-2-g14efabc-dirty TimeStamp:2018-08-02 01:13:58.4516939 +0000 UTC ExternalInterfaces:map[vEthernet (Ethernet 2):0xc042165980] store:0xc04210a8a0 Mutex:{state:0 sema:0}}
2018/08/02 01:14:01 External Interface &{vEthernet (Ethernet 2) map[azure:0xc042168a80] [10.240.0.0/12]  00:0d:3a:f9:09:bb [] [] 0.0.0.0 ::}
2018/08/02 01:14:01 network &{azure dd561022-1e37-4078-91d3-a30b26ee3124 bridge 0 [{2 {10.240.0.0 fff00000} 10.240.0.1}] map[88e3a956-eth0:0xc0421ce1e0 e5e4f423-eth0:0xc0421ce2d0 054b1b95-eth0:0xc0421ce000 63354a97-eth0:0xc0421ce0f0] 0xc042165980 false}
2018/08/02 01:14:01 endpoint &{054b1b95-eth0 b540b79d-36ab-4d56-88a3-489c9f964f31 054b1b95bf9e23330bc1db7dcee74e0b061244b71c4a404278216a0de9325d09 eth0  00:15:5d:54:18:9d [{10.240.0.99 fff00000}] [10.240.0.1] {default.svc.cluster.local [10.0.0.10 168.63.129.16]} [{{0.0.0.0 00000000} 10.240.0.1 }] 0 false}
2018/08/02 01:14:01 endpoint &{63354a97-eth0 26b883bb-5929-4181-8b45-07caebac5eeb 63354a97ca8e48605e132e1cdf8cfe2345c138ea51b2857e40d7fc7d3286f559 eth0  00:15:5d:54:18:39 [{10.240.0.119 fff00000}] [10.240.0.1] {default.svc.cluster.local [10.0.0.10 168.63.129.16]} [{{0.0.0.0 00000000} 10.240.0.1 }] 0 false}
2018/08/02 01:14:01 endpoint &{88e3a956-eth0 fc2c34c3-e8c8-4ad1-bfb3-0ee7871b5b71 88e3a95615b475593c3037c21e4515716aeff6052ad6861494bc7ac518caab31 eth0  00:15:5d:54:1c:15 [{10.240.0.105 fff00000}] [10.240.0.1] {default.svc.cluster.local [10.0.0.10 168.63.129.16]} [{{0.0.0.0 00000000} 10.240.0.1 }] 0 false}
2018/08/02 01:14:01 endpoint &{e5e4f423-eth0 277ee58f-6a6a-43f0-a0d2-0523263a5001 e5e4f42351ac8f4d645b3c4ca1772b061b0e727171fea85d5a427f06dd15cc59 eth0  00:15:5d:54:1e:fb [{10.240.0.100 fff00000}] [10.240.0.1] {default.svc.cluster.local [10.0.0.10 168.63.129.16]} [{{0.0.0.0 00000000} 10.240.0.1 }] 0 false}
2018/08/02 01:14:01 [cni-net] Plugin started.
2018/08/02 01:14:01 [cni-net] Processing ADD command with args {ContainerID:9441e82f9379324696a34449317b9f013467db91780adf5babcd6f993280f538 Netns:container:e5e4f42351ac8f4d645b3c4ca1772b061b0e727171fea85d5a427f06dd15cc59 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=whoami-1803-78fd64846f-9r5qz;K8S_POD_INFRA_CONTAINER_ID=9441e82f9379324696a34449317b9f013467db91780adf5babcd6f993280f538 Path:c:\k\azurecni\bin}.
2018/08/02 01:14:01 [cni-net] Read network configuration &{CNIVersion:0.3.0 Name:azure Type:azure-vnet Mode:bridge Master: Bridge:azure0 LogLevel: LogTarget: MultiTenancy:false EnableSnatOnHost:false Ipam:{Type:azure-vnet-ipam Environment: AddrSpace: Subnet: Address: QueryInterval:} DNS:{Nameservers:[10.0.0.10 168.63.129.16] Domain: Search:[svc.cluster.local] Options:[]} AdditionalArgs:[{Name:EndpointPolicy Value:[123 34 69 120 99 101 112 116 105 111 110 76 105 115 116 34 58 91 34 49 48 46 50 52 48 46 48 46 48 47 49 50 34 44 34 49 48 46 50 52 48 46 48 46 48 47 49 50 34 93 44 34 84 121 112 101 34 58 34 79 117 116 66 111 117 110 100 78 65 84 34 125]} {Name:EndpointPolicy Value:[123 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 58 34 49 48 46 48 46 48 46 48 47 49 54 34 44 34 78 101 101 100 69 110 99 97 112 34 58 116 114 117 101 44 34 84 121 112 101 34 58 34 82 79 85 84 69 34 125]}]}.
2018/08/02 01:14:01 PrimaryInterfaceIdentifier :<nil>
2018/08/02 01:14:01 Trying to retrieve endpoint id e5e4f423-eth0
2018/08/02 01:14:01 [net] Found existing endpoint through hcsshim: &{Id:277ee58f-6a6a-43f0-a0d2-0523263a5001 Name:e5e4f423-eth0 VirtualNetwork:dd561022-1e37-4078-91d3-a30b26ee3124 VirtualNetworkName:azure Policies:[[123 32 34 69 120 99 101 112 116 105 111 110 76 105 115 116 34 32 58 32 91 32 34 49 48 46 50 52 48 46 48 46 48 47 49 50 34 44 32 34 49 48 46 50 52 48 46 48 46 48 47 49 50 34 32 93 44 32 34 84 121 112 101 34 32 58 32 34 79 117 116 66 111 117 110 100 78 65 84 34 32 125] [123 32 34 68 101 115 116 105 110 97 116 105 111 110 80 114 101 102 105 120 34 32 58 32 34 49 48 46 48 46 48 46 48 92 47 49 54 34 44 32 34 78 101 101 100 69 110 99 97 112 34 32 58 32 116 114 117 101 44 32 34 84 121 112 101 34 32 58 32 34 82 79 85 84 69 34 32 125] [123 32 34 84 121 112 101 34 32 58 32 34 76 50 68 114 105 118 101 114 34 32 125]] MacAddress:00-15-5D-54-1E-FB IPAddress:10.240.0.100 DNSSuffix:default.svc.cluster.local DNSServerList:10.0.0.10,168.63.129.16 GatewayAddress:10.240.0.1 EnableInternalDNS:false DisableICC:false PrefixLength:12 IsRemoteEndpoint:false Namespace:<nil>}
2018/08/02 01:14:01 [net] Attaching ep 277ee58f-6a6a-43f0-a0d2-0523263a5001 to container 9441e82f9379324696a34449317b9f013467db91780adf5babcd6f993280f538
2018/08/02 01:14:01 [cni-net] Failed to hot attach shared endpoint[277ee58f-6a6a-43f0-a0d2-0523263a5001] to container [9441e82f9379324696a34449317b9f013467db91780adf5babcd6f993280f538], err:A Compute System with the specified identifier does not exist..
2018/08/02 01:14:01 handleConsecutiveAdd failed with error A Compute System with the specified identifier does not exist.
2018/08/02 01:14:01 [cni-net] ADD command completed with result:Interfaces:[{Name:eth0 Mac: Sandbox:}], DNS:{Nameservers:[] Domain: Search:[] Options:[]} err:A Compute System with the specified identifier does not exist..
2018/08/02 01:14:01 Failed to execute network plugin, err:A Compute System with the specified identifier does not exist..
2018/08/02 01:14:01 Report plugin error
2018/08/02 01:14:04 SendReport failed due to [Azure CNI] HTTP Post returned statuscode 500
2018/08/02 01:14:04 [cni-net] Plugin stopped.

@PatrickLang
Copy link
Contributor Author

@tamilmani1989 @sharmasushant - can you get a new azure-cni release with this fix?

@PatrickLang
Copy link
Contributor Author

This fix is available in ACS-Engine now:
https://github.com/Azure/acs-engine/releases/tag/v0.20.9

@Cherishty
Copy link

Hi team, I was meeting a regression on this issue on my k8s cluster deployed by acs-engine 0.21.2, the with the similar error in the kubelet.log, another error reported as:

Failed to initialize key-value store of network plugin, err:Store is locked.

and when I describe pods I find the pvc of azurefile mounts error which results in Pod sandbox changed, see below:

Events:
  Type     Reason                 Age                From                   Message
  ----     ------                 ----               ----                   -------
  Normal   SandboxChanged         13m (x25 over 1h)  kubelet, 37359k8s9001  Pod sandbox changed, it will be killed and re-created.
  Normal   SuccessfulMountVolume  9m                 kubelet, 37359k8s9001  MountVolume.SetUp succeeded for volume "log"
  Normal   SuccessfulMountVolume  9m                 kubelet, 37359k8s9001  MountVolume.SetUp succeeded for volume "default-token-bkpkh"
  Warning  FailedMount            9m                 kubelet, 37359k8s9001  MountVolume.SetUp failed for volume "pvc-211d3e2f-eda4-11e8-a14f-000d3a408701" : exit status 1
  Normal   SuccessfulMountVolume  9m                 kubelet, 37359k8s9001  MountVolume.SetUp succeeded for volume "pvc-211d3e2f-eda4-11e8-a14f-000d3a408701"
  Normal   SandboxChanged         2m (x2 over 6m)    kubelet, 37359k8s9001  Pod sandbox changed, it will be killed and re-created.

@ngrigoriev
Copy link

This problem is happening and seems to be quick to reproduce in our cluster.

E0909 12:47:27.938338    6072 cni_windows.go:50] error while adding to cni network: invalid character '{' after top-level value
W0909 12:47:27.938338    6072 docker_sandbox.go:384] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "********": invalid character '{' after top-level value
2019/09/09 12:47:28 Formatted Boot time: 2019-09-06 17:06:01
2019/09/09 12:47:28 Connected to telemetry service
2019/09/09 12:47:28 [Azure-Utils] ver
I0909 12:47:28.369804    6072 fake_cpu_manager.go:45] [fake cpumanager] RemoveContainer (container id: 32ef256a8bc968aef122e9c4b8f78379460ef50ab5aabcd9b559121ca3ad3619)
I0909 12:47:28.370806    6072 fake_cpu_manager.go:45] [fake cpumanager] RemoveContainer (container id: 6e8aedba55d6639f42c66a5674dd1c04c9c71fb0687c4eed7e7775104f863d82)
I0909 12:47:28.370806    6072 fake_cpu_manager.go:45] [fake cpumanager] RemoveContainer (container id: 7ed04d04acb645b8d99a0b926559089dd12d1c5a04f55c5bb817657cc5314c09)
E0909 12:47:28.370806    6072 pod_workers.go:190] Error syncing pod dc207ac8-d2ce-11e9-9c76-a62f07b3f53e ("**********(dc207ac8-d2ce-11e9-9c76-a62f07b3f53e)"), skipping: failed to "StartContainer" for "*******" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=******** pod=********(dc207ac8-d2ce-11e9-9c76-a62f07b3f53e)"
E0909 12:47:28.370806    6072 pod_workers.go:190] Error syncing pod d6bf2baf-d2ce-11e9-9c76-a62f07b3f53e ("**********(d6bf2baf-d2ce-11e9-9c76-a62f07b3f53e)"), skipping: failed to "StartContainer" for "********" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=******** pod=************(d6bf2baf-d2ce-11e9-9c76-a62f07b3f53e)"
I0909 12:47:28.549806    6072 fake_cpu_manager.go:40] [fake cpumanager] AddContainer (pod: *******, container:********, container id: de1dc01965ed9756db581642846ef4d7b96c79a0b73eb6f7539390f0d9cce75c)
2019/09/09 12:47:28 Formatted Boot time: 2019-09-06 17:06:01
2019/09/09 12:47:28 Connected to telemetry service

AKS Kubernetes 1.14.5, latest Windows VM image + KB4512534 installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants