Still seeing "error adding host side routes for interface: xxx, error: failed to add route file exists" in Calico 2.6.1 #1253

bradbehle · 2017-10-25T16:22:48Z

Still seeing error adding host side routes for interface: xxx, error: failed to add route file exists which is keeping containers from starting (stuck in "ContainerCreating"). This is detailed in https://github.com/projectcalico/cni-plugin/issues/352 (and also more detail about orphan WEPs which are a part of this problem can be found here #1115). It was thought that fix(es) in Calico 2.6.1 would solve this problem, but we are still seeing it.

Your Environment

Calico version: 2.6.1
Orchestrator version: Kubernetes 1.8.1
Operating System and version: Ubuntu 16.04.3

I will attach the logs, there are quite a few containers stuck in this ContainerCreating state, one is master-b4fe00b59ff948088731b4985367b705-6b987df84d-bz9zv in the kubx-masters namespace

The text was updated successfully, but these errors were encountered:

bradbehle · 2017-10-25T16:27:04Z

calico-node.log
kubelet.zip
calico-node.log

bradbehle · 2017-10-25T16:32:01Z

Above are the logs I could attach. I also have syslog, but it is too large to attach, let me know if you need it and I'll get it to you somehow.

caseydavenport · 2017-10-25T18:41:39Z

@bradbehle is this on a fresh cluster, or an upgraded cluster from a previous version? Could you do a quick double-check of the version of the CNI plugin in use?

This should do it:

/opt/cni/bin/calico -v

bradbehle · 2017-10-25T19:19:32Z

@caseydavenport

CNI plugin is at 1.11.0:

behle@dev-mex01-carrier80-worker-01:~$ /opt/cni/bin/calico -v 
v1.11.0
behle@dev-mex01-carrier80-worker-01:~$

And it was an upgraded cluster that was originally at kube 1.5 and Calico 2.1.5

r0bj · 2017-10-25T22:59:50Z

I've seen similar behaviour:

2017-10-25 22:42:21.163 [ERROR][21214] k8s.go 273: Error setting up networking: error adding host side routes for interface: cali665822806f1, error: failed to add route (Dst: 10.200.144.163/32, Scope: %!s(netlink.Scope=253), Iface: cali665822806f1): file exists Workload="ops.logstash-k8s-3042342600-7jrkq"

My setup:

# /opt/cni/bin/calico -v
v1.11.0

k8s: 1.8.1
calico: 2.6.1
OS: ubuntu 14.04

kubelet logs:
https://gist.github.com/r0bj/a656c75a7b08e1b79be4671206b70779

calico-node logs:
https://gist.github.com/r0bj/1279b3fa321ea5d47d66f5e4c945eadb

Cluster is upgraded from previous version but all pods are already recreated.

gunjan5 · 2017-11-03T16:31:20Z

@bradbehle @r0bj I have 2 PRs up for the fix, one for CNI v2.0 and one backported to CNI v1.11.x, I've made a CNI image with the fix backported to CNI v1.11.0, so you can try it out, I haven't been able to reproduce it but I've added a test to replicate the issue as best as I can. You can try the debug image with the fix at gunjan5/cni:route2 on Dockerhub. It should also have the CNI binary. It would be great if you can try it out, and let me know if it works for you.

gunjan5 · 2017-11-06T18:32:29Z

@bradbehle @r0bj just checking to see if you've had a chance to try out the debug image, if it works then we can get the PRs merged and get the fix included in the next patch releases

r0bj · 2017-11-06T18:56:57Z

@gunjan5 for me it's just difficult to reproduce it. I have encountered it twice on production cluster and after node restart everything worked as expected again. If I have better way of reproducing I'll test your debug image for sure.

msavlani · 2017-11-08T06:58:08Z

HI @gunjan5

I have exact same problem as @r0bj
I took your patch and applied in my environment.
And now i see below error :

Nov 8 00:53:33 node1 kubelet: E1108 00:53:33.577354 15218 pod_workers.go:182] Error syncing pod fb3638af-c443-11e7-9f0f-0894ef42f61e ("test1-test1-1193689166-zqh8b_namespace1(fb3638af-c443-11e7-9f0f-0894ef42f61e)"), skipping: failed to "SetupNetwork" for "aio-stage3-serviceavailability-mobilezip-1193689166-zqh8b_staging3" with SetupNetworkError: "NetworkPlugin cni failed to set up pod "test1-test1-1193689166-zqh8b_namespace1" network: error adding host side routes for interface: cali38f5e43d7eb, error: route (Ifindex: 8374, Dst: 172.40.107.158/32, Scope: %!!(MISSING)s(netlink.Scope=253)) already exists for an interface other than 'cali38f5e43d7eb'"

Error is little differnent this time.

It says route already exists for an interface other than 'calixxxxx'

What needs to be done to resolve this ?
Any WA solution you can give for now ?

gunjan5 · 2017-11-08T20:05:39Z

@msavlani can you post the CNI debug logs?

r0bj · 2017-11-10T16:40:23Z

@gunjan5 I used your debug image and I was able to reproduce it.

kubelet error message:

2017-11-10 16:34:00.247 [WARNING][27292] calico-ipam.go 236: Asked to release address but it doesn't exist. Ignoring Workload="ops.nginx-stateful-test-0" workloadId="ops.nginx-stateful-test-0"
E1110 16:34:00.256432    1467 cni.go:301] Error adding network: error adding host side routes for interface: cali05ad595351c, error: route (Ifindex: 93, Dst: 10.203.131.144/32, Scope: %!!(MISSING)s(netlink.Scope=253)) already exists for an interface other than 'cali05ad595351c'
E1110 16:34:00.256462    1467 cni.go:250] Error while adding to cni network: error adding host side routes for interface: cali05ad595351c, error: route (Ifindex: 93, Dst: 10.203.131.144/32, Scope: %!!(MISSING)s(netlink.Scope=253)) already exists for an interface other than 'cali05ad595351c'
I1110 16:34:00.306370    1467 kubelet.go:1871] SyncLoop (PLEG): "nginx-stateful-test-0_ops(90302ede-c634-11e7-883a-5254009ef6db)", event: &pleg.PodLifecycleEvent{ID:"90302ede-c634-11e7-883a-5254009ef6db", Type:"ContainerStarted", Data:"85e04cf0fcb9830d0824805bb9ca4849b5c53ee93124332a746d22a8956de823"}
E1110 16:34:00.457030    1467 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "nginx-stateful-test-0_ops" network: error adding host side routes for interface: cali05ad595351c, error: route (Ifindex: 93, Dst: 10.203.131.144/32, Scope: %!!(MISSING)s(netlink.Scope=253)) already exists for an interface other than 'cali05ad595351c'

calico error:

2017-11-10 16:33:16.862 [WARNING][156] route_table.go 477: Failed to add route error=file exists ifaceName="cali05ad595351c" ipVersion=0x4 targetCIDR=10.203.131.144/32
2017-11-10 16:33:16.862 [INFO][156] route_table.go 200: Trying to connect to netlink
2017-11-10 16:33:16.863 [WARNING][156] route_table.go 577: Failed to access interface but it now appears to be up error=netlink update operation failed ifaceName="cali05ad595351c" ipVersion=0x4 link=&netlink.Veth{LinkAttrs:netlink.LinkAttrs{Index:78, MTU:1500, TxQLen:0, Name:"cali05ad595351c", HardwareAddr:net.HardwareAddr{0x6e, 0x40, 0x72, 0xce, 0xb0, 0xb3}, Flags:0x13, RawFlags:0x11043, ParentIndex:4, MasterIndex:0, Namespace:interface {}(nil), Alias:"", Statistics:(*netlink.LinkStatistics)(0xc420ad7140), Promisc:0, Xdp:(*netlink.LinkXdp)(nil), EncapType:"ether", Protinfo:(*netlink.Protinfo)(nil), OperState:0x6}, PeerName:""}

kubelet logs:
https://gist.github.com/r0bj/9a96a0041733e9cb9a2e126d4ef224b2
calico logs:
https://gist.github.com/r0bj/ab637b6ab3deeaa4a88646d5f22dad4d

# /opt/cni/bin/calico -v
v1.11.0-1-g8694471

r0bj · 2017-11-10T18:44:47Z

Another example:
kubelet logs:
https://gist.github.com/r0bj/9ee5a7b1d15fde85390f5ceb99a76c15
calico logs:
https://gist.github.com/r0bj/a31672f4835ed4d12c41a788ede495ac

fasaxc · 2017-11-13T15:28:58Z

@r0bj @msavlani Thanks for providing the logs, both the kubelet and calico/node logs make it look like we've assigned the IP to one endpoint, then tried to assign it to another but it's not clear from the logs how that happened because the logs start around the time that we're trying to assign the IP to the second endpoint.

If you can get a node in the bad state again, it'd help to have:

any logs from before the problem, if you could look back to see if the IP of the problematic route occurs earlier in the log, that'd be good
the output of ip route on the affected node, that'll confirm that the bad route belongs to Calico and that it's still in-place
the output of calicoctl get workloadendpoint --node=<nodename> -o yaml for the affected node; that should dump all the workloads that Calico thinks it has on the node.

fasaxc · 2017-11-13T16:02:44Z

@gunjan5 From code reading, I spot one issue; after a failed networking attempt, we always clean up the IPAM allocation. However, if this is an attempt to re-network a pod then we've already written the IP into the workload endpoint when we first networked the pod so, I think, we end up deleting the IPAM reservation without removing the IP from the workload endpoint. Later, we'll then try to re-use that IP for another pod and hit this failure mode.

If we're in the existing endpoint case, I think we just need to leave the IPAM allocation as is and fail the CNI request so that it gets retried. I guess we could delete the workload endpoint.

r0bj · 2017-11-13T16:25:14Z

@fasaxc data you requested:

logs from the the time when node started:
kubelet logs:
https://gist.github.com/r0bj/f795bf18dec385ebd5e0035544649d17
calico-node logs:
https://gist.github.com/r0bj/02819862d4e30c19501b06ecd9dc67c3
ip route on the node:
https://gist.github.com/r0bj/58c931dec5e795d8fa068be21cf5c747
output of calicoctl get workloadendpoint --node=dev-k8s-worker-p4 -o yaml:
https://gist.github.com/r0bj/4eda1531de42bb3d18df439604dd4db6

fasaxc · 2017-11-13T18:36:33Z

Awesome, thanks @r0bj, it looks like you may have hit the case I described above. You have two workload endpoints with the same IP assigned. In addition, the IPAM error in the log indicates that the IPAM allocation was incorrectly lost or cleaned up.

Assuming this is the only instance of the problem on your node, a temporary workaround would be to delete and recreate these two pods: ops.nsqlookupd2-7574c7fbd5-rnntr and ops.nginx-stateful-test-0.

I found those by searching for the IP address of the failing route in the workload endpoint dump.

msavlani · 2017-11-17T08:47:34Z

@fasaxc , i am encountering exact same problem.
Lately i am facing this with every pod running on that one node and had to manually delete pod with same endpoint (Going in cycles)
Can you provide your fix patch so that i can try in my environment to see if that resolves the problem ?

mcluseau · 2017-11-22T23:33:13Z

@fasaxc thanks for the fix, I upgraded to v1.x-series/v2.x-series docker tags for cni and node, it solved the issue for me.

fasaxc · 2017-11-23T13:30:45Z

@msavlani The fix is in This release of the CNI plugin. We're about to release a Calico v2.6.3 patch release to include it. https://github.com/projectcalico/cni-plugin/releases/tag/v1.11.1

Note: after taking the fix, you'll need to remove and recreate all the pods on affected nodes.

fasaxc · 2017-11-24T17:56:44Z

Fixed by #425 #418 #408 #406, to be released in Calico v2.6.3 (CNI plugin v1.11.1) and and Calico v3.0.

Please open a new issue if you see this again after upgrading (the Calico release should be out in the next few days). Note: as mentioned above, one issue was that, after a failure, we were cleaning up the IPAM entry int he datastore even though it was in use. After that has occurred the datastore is inconsistent. To resolve:

upgrade to release of Calico with the fix
list all the pods on nodes with the problem
delete/recreate any pods that share an IP address.

msavlani · 2017-11-26T04:15:32Z

I am still facing this issue and raised another issue #1406 as per above suggestion.

caseydavenport assigned caseydavenport and gunjan5 Oct 25, 2017

caseydavenport added the priority/P1 label Oct 25, 2017

caseydavenport added this to the Calico v2.6.x milestone Oct 25, 2017

This was referenced Oct 31, 2017

bugfix: CNI plugin throws file exists while programming container route projectcalico/cni-plugin#406

Merged

backport bugfix: CNI plugin throws file exists while programming route projectcalico/cni-plugin#408

Merged

fasaxc added priority/P0 and removed priority/P1 labels Nov 13, 2017

fasaxc mentioned this issue Nov 15, 2017

v3: Fix ipam inconsistency projectcalico/cni-plugin#425

Merged

3 tasks

caseydavenport modified the milestones: Calico v2.6.x, Calico v2.6.3 Nov 18, 2017

fasaxc closed this as completed Nov 24, 2017

msavlani mentioned this issue Nov 26, 2017

Still seeing "error adding host side routes for interface: xxx, error: failed to add route file exists" with calico cni 1.11.1 #1406

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Still seeing "error adding host side routes for interface: xxx, error: failed to add route file exists" in Calico 2.6.1 #1253

Still seeing "error adding host side routes for interface: xxx, error: failed to add route file exists" in Calico 2.6.1 #1253

bradbehle commented Oct 25, 2017

bradbehle commented Oct 25, 2017

bradbehle commented Oct 25, 2017

caseydavenport commented Oct 25, 2017

bradbehle commented Oct 25, 2017

r0bj commented Oct 25, 2017 •

edited

Loading

gunjan5 commented Nov 3, 2017 •

edited

Loading

gunjan5 commented Nov 6, 2017

r0bj commented Nov 6, 2017

msavlani commented Nov 8, 2017 •

edited

Loading

gunjan5 commented Nov 8, 2017

r0bj commented Nov 10, 2017 •

edited

Loading

r0bj commented Nov 10, 2017

fasaxc commented Nov 13, 2017

fasaxc commented Nov 13, 2017

r0bj commented Nov 13, 2017

fasaxc commented Nov 13, 2017

msavlani commented Nov 17, 2017

mcluseau commented Nov 22, 2017

fasaxc commented Nov 23, 2017

fasaxc commented Nov 24, 2017

msavlani commented Nov 26, 2017

Still seeing "error adding host side routes for interface: xxx, error: failed to add route file exists" in Calico 2.6.1 #1253

Still seeing "error adding host side routes for interface: xxx, error: failed to add route file exists" in Calico 2.6.1 #1253

Comments

bradbehle commented Oct 25, 2017

Your Environment

bradbehle commented Oct 25, 2017

bradbehle commented Oct 25, 2017

caseydavenport commented Oct 25, 2017

bradbehle commented Oct 25, 2017

r0bj commented Oct 25, 2017 • edited Loading

gunjan5 commented Nov 3, 2017 • edited Loading

gunjan5 commented Nov 6, 2017

r0bj commented Nov 6, 2017

msavlani commented Nov 8, 2017 • edited Loading

gunjan5 commented Nov 8, 2017

r0bj commented Nov 10, 2017 • edited Loading

r0bj commented Nov 10, 2017

fasaxc commented Nov 13, 2017

fasaxc commented Nov 13, 2017

r0bj commented Nov 13, 2017

fasaxc commented Nov 13, 2017

msavlani commented Nov 17, 2017

mcluseau commented Nov 22, 2017

fasaxc commented Nov 23, 2017

fasaxc commented Nov 24, 2017

msavlani commented Nov 26, 2017

r0bj commented Oct 25, 2017 •

edited

Loading

gunjan5 commented Nov 3, 2017 •

edited

Loading

msavlani commented Nov 8, 2017 •

edited

Loading

r0bj commented Nov 10, 2017 •

edited

Loading