-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Still seeing "error adding host side routes for interface: xxx, error: failed to add route file exists" in Calico 2.6.1 #1253
Comments
Above are the logs I could attach. I also have syslog, but it is too large to attach, let me know if you need it and I'll get it to you somehow. |
@bradbehle is this on a fresh cluster, or an upgraded cluster from a previous version? Could you do a quick double-check of the version of the CNI plugin in use? This should do it:
|
CNI plugin is at 1.11.0:
And it was an upgraded cluster that was originally at kube 1.5 and Calico 2.1.5 |
I've seen similar behaviour:
My setup:
k8s: 1.8.1 kubelet logs: calico-node logs: Cluster is upgraded from previous version but all pods are already recreated. |
@bradbehle @r0bj I have 2 PRs up for the fix, one for CNI v2.0 and one backported to CNI v1.11.x, I've made a CNI image with the fix backported to CNI v1.11.0, so you can try it out, I haven't been able to reproduce it but I've added a test to replicate the issue as best as I can. You can try the debug image with the fix at |
@bradbehle @r0bj just checking to see if you've had a chance to try out the debug image, if it works then we can get the PRs merged and get the fix included in the next patch releases |
@gunjan5 for me it's just difficult to reproduce it. I have encountered it twice on production cluster and after node restart everything worked as expected again. If I have better way of reproducing I'll test your debug image for sure. |
HI @gunjan5 I have exact same problem as @r0bj Nov 8 00:53:33 node1 kubelet: E1108 00:53:33.577354 15218 pod_workers.go:182] Error syncing pod fb3638af-c443-11e7-9f0f-0894ef42f61e ("test1-test1-1193689166-zqh8b_namespace1(fb3638af-c443-11e7-9f0f-0894ef42f61e)"), skipping: failed to "SetupNetwork" for "aio-stage3-serviceavailability-mobilezip-1193689166-zqh8b_staging3" with SetupNetworkError: "NetworkPlugin cni failed to set up pod "test1-test1-1193689166-zqh8b_namespace1" network: error adding host side routes for interface: cali38f5e43d7eb, error: route (Ifindex: 8374, Dst: 172.40.107.158/32, Scope: %!!(MISSING)s(netlink.Scope=253)) already exists for an interface other than 'cali38f5e43d7eb'" Error is little differnent this time. It says route already exists for an interface other than 'calixxxxx' What needs to be done to resolve this ? |
@msavlani can you post the CNI debug logs? |
@gunjan5 I used your debug image and I was able to reproduce it. kubelet error message:
calico error:
kubelet logs:
|
Another example: |
@r0bj @msavlani Thanks for providing the logs, both the kubelet and calico/node logs make it look like we've assigned the IP to one endpoint, then tried to assign it to another but it's not clear from the logs how that happened because the logs start around the time that we're trying to assign the IP to the second endpoint. If you can get a node in the bad state again, it'd help to have:
|
@gunjan5 From code reading, I spot one issue; after a failed networking attempt, we always clean up the IPAM allocation. However, if this is an attempt to re-network a pod then we've already written the IP into the workload endpoint when we first networked the pod so, I think, we end up deleting the IPAM reservation without removing the IP from the workload endpoint. Later, we'll then try to re-use that IP for another pod and hit this failure mode. If we're in the existing endpoint case, I think we just need to leave the IPAM allocation as is and fail the CNI request so that it gets retried. I guess we could delete the workload endpoint. |
@fasaxc data you requested:
|
Awesome, thanks @r0bj, it looks like you may have hit the case I described above. You have two workload endpoints with the same IP assigned. In addition, the IPAM error in the log indicates that the IPAM allocation was incorrectly lost or cleaned up. Assuming this is the only instance of the problem on your node, a temporary workaround would be to delete and recreate these two pods: I found those by searching for the IP address of the failing route in the workload endpoint dump. |
@fasaxc , i am encountering exact same problem. |
@fasaxc thanks for the fix, I upgraded to v1.x-series/v2.x-series docker tags for cni and node, it solved the issue for me. |
@msavlani The fix is in This release of the CNI plugin. We're about to release a Calico v2.6.3 patch release to include it. https://github.com/projectcalico/cni-plugin/releases/tag/v1.11.1 Note: after taking the fix, you'll need to remove and recreate all the pods on affected nodes. |
Fixed by #425 #418 #408 #406, to be released in Calico v2.6.3 (CNI plugin v1.11.1) and and Calico v3.0. Please open a new issue if you see this again after upgrading (the Calico release should be out in the next few days). Note: as mentioned above, one issue was that, after a failure, we were cleaning up the IPAM entry int he datastore even though it was in use. After that has occurred the datastore is inconsistent. To resolve:
|
I am still facing this issue and raised another issue #1406 as per above suggestion. |
Still seeing
error adding host side routes for interface: xxx, error: failed to add route file exists
which is keeping containers from starting (stuck in "ContainerCreating"). This is detailed in https://github.com/projectcalico/cni-plugin/issues/352 (and also more detail about orphan WEPs which are a part of this problem can be found here #1115). It was thought that fix(es) in Calico 2.6.1 would solve this problem, but we are still seeing it.Your Environment
I will attach the logs, there are quite a few containers stuck in this ContainerCreating state, one is master-b4fe00b59ff948088731b4985367b705-6b987df84d-bz9zv in the kubx-masters namespace
The text was updated successfully, but these errors were encountered: