Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still seeing "error adding host side routes for interface: xxx, error: failed to add route file exists" in Calico 2.6.1 #1253

Closed
bradbehle opened this issue Oct 25, 2017 · 21 comments
Assignees
Milestone

Comments

@bradbehle
Copy link
Contributor

Still seeing error adding host side routes for interface: xxx, error: failed to add route file exists which is keeping containers from starting (stuck in "ContainerCreating"). This is detailed in https://github.com/projectcalico/cni-plugin/issues/352 (and also more detail about orphan WEPs which are a part of this problem can be found here #1115). It was thought that fix(es) in Calico 2.6.1 would solve this problem, but we are still seeing it.

Your Environment

  • Calico version: 2.6.1
  • Orchestrator version: Kubernetes 1.8.1
  • Operating System and version: Ubuntu 16.04.3

I will attach the logs, there are quite a few containers stuck in this ContainerCreating state, one is master-b4fe00b59ff948088731b4985367b705-6b987df84d-bz9zv in the kubx-masters namespace

@bradbehle
Copy link
Contributor Author

@bradbehle
Copy link
Contributor Author

Above are the logs I could attach. I also have syslog, but it is too large to attach, let me know if you need it and I'll get it to you somehow.

@caseydavenport
Copy link
Member

@bradbehle is this on a fresh cluster, or an upgraded cluster from a previous version? Could you do a quick double-check of the version of the CNI plugin in use?

This should do it:

/opt/cni/bin/calico -v 

@bradbehle
Copy link
Contributor Author

@caseydavenport

CNI plugin is at 1.11.0:

behle@dev-mex01-carrier80-worker-01:~$ /opt/cni/bin/calico -v 
v1.11.0
behle@dev-mex01-carrier80-worker-01:~$

And it was an upgraded cluster that was originally at kube 1.5 and Calico 2.1.5

@r0bj
Copy link

r0bj commented Oct 25, 2017

I've seen similar behaviour:

2017-10-25 22:42:21.163 [ERROR][21214] k8s.go 273: Error setting up networking: error adding host side routes for interface: cali665822806f1, error: failed to add route (Dst: 10.200.144.163/32, Scope: %!s(netlink.Scope=253), Iface: cali665822806f1): file exists Workload="ops.logstash-k8s-3042342600-7jrkq"

My setup:

# /opt/cni/bin/calico -v
v1.11.0

k8s: 1.8.1
calico: 2.6.1
OS: ubuntu 14.04

kubelet logs:
https://gist.github.com/r0bj/a656c75a7b08e1b79be4671206b70779

calico-node logs:
https://gist.github.com/r0bj/1279b3fa321ea5d47d66f5e4c945eadb

Cluster is upgraded from previous version but all pods are already recreated.

@gunjan5
Copy link
Contributor

gunjan5 commented Nov 3, 2017

@bradbehle @r0bj I have 2 PRs up for the fix, one for CNI v2.0 and one backported to CNI v1.11.x, I've made a CNI image with the fix backported to CNI v1.11.0, so you can try it out, I haven't been able to reproduce it but I've added a test to replicate the issue as best as I can. You can try the debug image with the fix at gunjan5/cni:route2 on Dockerhub. It should also have the CNI binary. It would be great if you can try it out, and let me know if it works for you.

@gunjan5
Copy link
Contributor

gunjan5 commented Nov 6, 2017

@bradbehle @r0bj just checking to see if you've had a chance to try out the debug image, if it works then we can get the PRs merged and get the fix included in the next patch releases

@r0bj
Copy link

r0bj commented Nov 6, 2017

@gunjan5 for me it's just difficult to reproduce it. I have encountered it twice on production cluster and after node restart everything worked as expected again. If I have better way of reproducing I'll test your debug image for sure.

@msavlani
Copy link

msavlani commented Nov 8, 2017

HI @gunjan5

I have exact same problem as @r0bj
I took your patch and applied in my environment.
And now i see below error :

Nov 8 00:53:33 node1 kubelet: E1108 00:53:33.577354 15218 pod_workers.go:182] Error syncing pod fb3638af-c443-11e7-9f0f-0894ef42f61e ("test1-test1-1193689166-zqh8b_namespace1(fb3638af-c443-11e7-9f0f-0894ef42f61e)"), skipping: failed to "SetupNetwork" for "aio-stage3-serviceavailability-mobilezip-1193689166-zqh8b_staging3" with SetupNetworkError: "NetworkPlugin cni failed to set up pod "test1-test1-1193689166-zqh8b_namespace1" network: error adding host side routes for interface: cali38f5e43d7eb, error: route (Ifindex: 8374, Dst: 172.40.107.158/32, Scope: %!!(MISSING)s(netlink.Scope=253)) already exists for an interface other than 'cali38f5e43d7eb'"

Error is little differnent this time.

It says route already exists for an interface other than 'calixxxxx'

What needs to be done to resolve this ?
Any WA solution you can give for now ?

@gunjan5
Copy link
Contributor

gunjan5 commented Nov 8, 2017

@msavlani can you post the CNI debug logs?

@r0bj
Copy link

r0bj commented Nov 10, 2017

@gunjan5 I used your debug image and I was able to reproduce it.

kubelet error message:

2017-11-10 16:34:00.247 [WARNING][27292] calico-ipam.go 236: Asked to release address but it doesn't exist. Ignoring Workload="ops.nginx-stateful-test-0" workloadId="ops.nginx-stateful-test-0"
E1110 16:34:00.256432    1467 cni.go:301] Error adding network: error adding host side routes for interface: cali05ad595351c, error: route (Ifindex: 93, Dst: 10.203.131.144/32, Scope: %!!(MISSING)s(netlink.Scope=253)) already exists for an interface other than 'cali05ad595351c'
E1110 16:34:00.256462    1467 cni.go:250] Error while adding to cni network: error adding host side routes for interface: cali05ad595351c, error: route (Ifindex: 93, Dst: 10.203.131.144/32, Scope: %!!(MISSING)s(netlink.Scope=253)) already exists for an interface other than 'cali05ad595351c'
I1110 16:34:00.306370    1467 kubelet.go:1871] SyncLoop (PLEG): "nginx-stateful-test-0_ops(90302ede-c634-11e7-883a-5254009ef6db)", event: &pleg.PodLifecycleEvent{ID:"90302ede-c634-11e7-883a-5254009ef6db", Type:"ContainerStarted", Data:"85e04cf0fcb9830d0824805bb9ca4849b5c53ee93124332a746d22a8956de823"}
E1110 16:34:00.457030    1467 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "nginx-stateful-test-0_ops" network: error adding host side routes for interface: cali05ad595351c, error: route (Ifindex: 93, Dst: 10.203.131.144/32, Scope: %!!(MISSING)s(netlink.Scope=253)) already exists for an interface other than 'cali05ad595351c'

calico error:

2017-11-10 16:33:16.862 [WARNING][156] route_table.go 477: Failed to add route error=file exists ifaceName="cali05ad595351c" ipVersion=0x4 targetCIDR=10.203.131.144/32
2017-11-10 16:33:16.862 [INFO][156] route_table.go 200: Trying to connect to netlink
2017-11-10 16:33:16.863 [WARNING][156] route_table.go 577: Failed to access interface but it now appears to be up error=netlink update operation failed ifaceName="cali05ad595351c" ipVersion=0x4 link=&netlink.Veth{LinkAttrs:netlink.LinkAttrs{Index:78, MTU:1500, TxQLen:0, Name:"cali05ad595351c", HardwareAddr:net.HardwareAddr{0x6e, 0x40, 0x72, 0xce, 0xb0, 0xb3}, Flags:0x13, RawFlags:0x11043, ParentIndex:4, MasterIndex:0, Namespace:interface {}(nil), Alias:"", Statistics:(*netlink.LinkStatistics)(0xc420ad7140), Promisc:0, Xdp:(*netlink.LinkXdp)(nil), EncapType:"ether", Protinfo:(*netlink.Protinfo)(nil), OperState:0x6}, PeerName:""}

kubelet logs:
https://gist.github.com/r0bj/9a96a0041733e9cb9a2e126d4ef224b2
calico logs:
https://gist.github.com/r0bj/ab637b6ab3deeaa4a88646d5f22dad4d

# /opt/cni/bin/calico -v
v1.11.0-1-g8694471

@r0bj
Copy link

r0bj commented Nov 10, 2017

@fasaxc
Copy link
Member

fasaxc commented Nov 13, 2017

@r0bj @msavlani Thanks for providing the logs, both the kubelet and calico/node logs make it look like we've assigned the IP to one endpoint, then tried to assign it to another but it's not clear from the logs how that happened because the logs start around the time that we're trying to assign the IP to the second endpoint.

If you can get a node in the bad state again, it'd help to have:

  • any logs from before the problem, if you could look back to see if the IP of the problematic route occurs earlier in the log, that'd be good
  • the output of ip route on the affected node, that'll confirm that the bad route belongs to Calico and that it's still in-place
  • the output of calicoctl get workloadendpoint --node=<nodename> -o yaml for the affected node; that should dump all the workloads that Calico thinks it has on the node.

@fasaxc
Copy link
Member

fasaxc commented Nov 13, 2017

@gunjan5 From code reading, I spot one issue; after a failed networking attempt, we always clean up the IPAM allocation. However, if this is an attempt to re-network a pod then we've already written the IP into the workload endpoint when we first networked the pod so, I think, we end up deleting the IPAM reservation without removing the IP from the workload endpoint. Later, we'll then try to re-use that IP for another pod and hit this failure mode.

If we're in the existing endpoint case, I think we just need to leave the IPAM allocation as is and fail the CNI request so that it gets retried. I guess we could delete the workload endpoint.

@r0bj
Copy link

r0bj commented Nov 13, 2017

@fasaxc data you requested:

  1. logs from the the time when node started:
    kubelet logs:
    https://gist.github.com/r0bj/f795bf18dec385ebd5e0035544649d17
    calico-node logs:
    https://gist.github.com/r0bj/02819862d4e30c19501b06ecd9dc67c3

  2. ip route on the node:
    https://gist.github.com/r0bj/58c931dec5e795d8fa068be21cf5c747

  3. output of calicoctl get workloadendpoint --node=dev-k8s-worker-p4 -o yaml:
    https://gist.github.com/r0bj/4eda1531de42bb3d18df439604dd4db6

@fasaxc
Copy link
Member

fasaxc commented Nov 13, 2017

Awesome, thanks @r0bj, it looks like you may have hit the case I described above. You have two workload endpoints with the same IP assigned. In addition, the IPAM error in the log indicates that the IPAM allocation was incorrectly lost or cleaned up.

Assuming this is the only instance of the problem on your node, a temporary workaround would be to delete and recreate these two pods: ops.nsqlookupd2-7574c7fbd5-rnntr and ops.nginx-stateful-test-0.

I found those by searching for the IP address of the failing route in the workload endpoint dump.

@msavlani
Copy link

@fasaxc , i am encountering exact same problem.
Lately i am facing this with every pod running on that one node and had to manually delete pod with same endpoint (Going in cycles)
Can you provide your fix patch so that i can try in my environment to see if that resolves the problem ?

@mcluseau
Copy link

@fasaxc thanks for the fix, I upgraded to v1.x-series/v2.x-series docker tags for cni and node, it solved the issue for me.

@fasaxc
Copy link
Member

fasaxc commented Nov 23, 2017

@msavlani The fix is in This release of the CNI plugin. We're about to release a Calico v2.6.3 patch release to include it. https://github.com/projectcalico/cni-plugin/releases/tag/v1.11.1

Note: after taking the fix, you'll need to remove and recreate all the pods on affected nodes.

@fasaxc
Copy link
Member

fasaxc commented Nov 24, 2017

Fixed by #425 #418 #408 #406, to be released in Calico v2.6.3 (CNI plugin v1.11.1) and and Calico v3.0.

Please open a new issue if you see this again after upgrading (the Calico release should be out in the next few days). Note: as mentioned above, one issue was that, after a failure, we were cleaning up the IPAM entry int he datastore even though it was in use. After that has occurred the datastore is inconsistent. To resolve:

  • upgrade to release of Calico with the fix
  • list all the pods on nodes with the problem
  • delete/recreate any pods that share an IP address.

@msavlani
Copy link

I am still facing this issue and raised another issue #1406 as per above suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants