New pods failing to start with `FailedCreatePodSandBox` warning for CNI versions 1.7.x with Cilium #1265

YesemKebede · 2020-10-19T16:47:00Z

What happened:

New pods started failing to come up after upgrading to eks CNI v1.7.0 from v1.6.0. I was able to upgrade to v1.6.3 without any issue. I started to see the errors when I upgraded to 1.7.0. I also tried to upgrade to other version ( v1.7.2 and v1.7.5) but I am seeing the same issue.

Here is the error I am seeing.

 Warning  FailedCreatePodSandBox  28s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7e3423d27fc6f36276de03aa7f41ef6b6f02121f800b65b64b8073c6a207b696" network for pod "spinnaker-get-resource-type-3fc73e4e3611d9f4-ps4b7": networkPlugin cni failed to set up pod "spinnaker-get-resource-type-3fc73e4e3611d9f4-ps4b7_default" network: invalidcharacter '{' after top-level value

Here is the cni log

Anything else we need to know?:

We have Cilium running in chaining mode (v1.8.4)

Environment:

Kubernetes version :v1.17.9-eks-4c6976
CNI Version: Tried different versions but seeing same issue for (1.7.0, 1.7.2, 1.7.5)
Kernel: 5.4.58-27.104.amzn2.x86_64

The text was updated successfully, but these errors were encountered:

jayanthvn · 2020-10-19T16:58:05Z

Hi @YesemKebede

Can you please confirm if you have set AWS_VPC_K8S_PLUGIN_LOG_FILE to stdout?

I checked IPAMD logs and I see IP allocation seems fine on the first look. We will further investigate the issue.

Thanks.

YesemKebede · 2020-10-19T17:03:15Z

@jayanthvn AWS_VPC_K8S_PLUGIN_LOG_FILE is set to /var/log/aws-routed-eni/plugin.log

jayanthvn · 2020-10-19T17:05:28Z

Thanks @YesemKebede . We will look into it asap.

jayanthvn · 2020-10-19T18:11:53Z

Hi @YesemKebede

Can also please confirm how you upgraded from 1.6.3 to 1.7.X?

Thank you!

YesemKebede · 2020-10-19T18:30:25Z

@jayanthvn I followed this Doc

sophomeric · 2020-10-22T16:57:25Z

I upgraded from 1.6.3 to 1.7.5 and had the same problem. No new pod could be started and they had that same error. I had both AWS_VPC_K8S_CNI_LOG_FILE and AWS_VPC_K8S_PLUGIN_LOG_FILE set to stdout and had this same problem. Removing them so they get sent to files as per their default config solved the issue for me.

Google led me here: Azure/azure-container-networking#195 (comment)

jayanthvn · 2020-10-22T17:10:48Z

@sophomeric Yes setting AWS_VPC_K8S_PLUGIN_LOG_FILE to stdout will cause a similar issue(#1251). But here it wasn't set.

Aggouri · 2020-10-26T19:51:41Z

We are experiencing the same issue on newly provisioned clusters with the following difference in versions:

Kubernetes version: v1.17.11-eks-cfdc40
Cilium v1.9.0-rc2 in chaining mode

If it helps, although I am not 100% sure about the Kubernetes version being exactly the same patch version, that same configuration was working last week on a different cluster with the same characteristics.

jayanthvn · 2020-10-26T20:35:02Z

Hi @Aggouri

Can you please confirm the CNI version for the two clusters?

kubectl describe daemonset aws-node -n kube-system | grep Image | cut -d "/" -f 2

Thanks.

Aggouri · 2020-10-26T22:09:18Z

@jayanthvn

Can you please confirm the CNI version for the two clusters?

The cluster was provisioned a few hours ago:

$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2

amazon-k8s-cni-init:v1.7.5-eksbuild.1
amazon-k8s-cni:v1.7.5-eksbuild.1

Sadly, I am unable to provide the CNI plugin version of the previous cluster as it was already torn down. If it helps, I know it was provisioned at the beginning of last week and used the defaults EKS came with for Kubernetes version 1.17.x.

jayanthvn · 2020-10-26T22:21:43Z

Thanks for conforming @Aggouri . We are actively looking into the issue. Will update asap.

Arsen-Uulu · 2020-10-27T14:55:54Z

@jayanthvn upgraded from 1.6.3 to 1.7.5 having a problem.

{"level":"error","ts":"2020-10-27T10:44:04.889-0400","caller":"routed-eni-cni-plugin/cni.go:249","msg":"Error received from DelNetwork gRPC call for container ba592f75d2b25963c4bd64f218ae0930917fa39e3efffd2231b313f8eb42d344: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:50051

jayanthvn · 2020-10-27T16:06:44Z

Hi,

We have found the RC, for now please add pluginLogFile and pluginLogLevel in 05-cilium.conflist. We will fix this issue in the next release.

cat /etc/cni/net.d/05-cilium.conflist
{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni",
      "mtu": "9001",
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "Debug"
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

I was able to repro and below is the o/p after fixing the conflist -

dev-dsk-varavaj-2b-72f02457 % kubectl describe daemonset aws-node -n kube-system | grep 1.7.5
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.7.5
    Image:      602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.7.5

NAME                       READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
my-nginx-86b7cfc89-jvzvw   1/1     Running   0          18h   192.168.10.206   ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
my-nginx-86b7cfc89-p4q2t   1/1     Running   0          18m   192.168.67.156   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>

NAME                               READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
aws-node-95jtw                     1/1     Running   0          23m   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
aws-node-cnrkq                     1/1     Running   0          24m   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
aws-node-j64z5                     1/1     Running   0          23m   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-5gr4s                       1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-d4nff                       1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
cilium-node-init-kwsj6             1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
cilium-node-init-pv4jw             1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-node-init-pxdfv             1/1     Running   0          18h   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
cilium-operator-6554b44b9d-f88zj   1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
cilium-operator-6554b44b9d-j8tlb   1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
cilium-qg6tf                       1/1     Running   0          18h   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
coredns-5c97f79574-9nnkk           1/1     Running   0          18h   192.168.68.203   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
coredns-5c97f79574-jnsm2           1/1     Running   0          18h   100.64.95.97     ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>
kube-proxy-bmv86                   1/1     Running   0          18h   192.168.81.109   ip-192-168-81-109.us-west-2.compute.internal   <none>           <none>
kube-proxy-j7c8f                   1/1     Running   0          18h   192.168.0.43     ip-192-168-0-43.us-west-2.compute.internal     <none>           <none>
kube-proxy-ss98z                   1/1     Running   0          18h   192.168.51.208   ip-192-168-51-208.us-west-2.compute.internal   <none>           <none>

Thank you!

jayanthvn · 2020-11-20T18:48:52Z

#1275 is merged so closing this issue.

bogarcia · 2020-11-24T17:12:58Z

is there any ETA for a new release including this fix?
Thanks!

part-time-githubber · 2021-03-08T03:22:18Z

I tried the workaround suggested in #1265 (comment) 👍

After this, coredns is RUNNING but NOT READY. This from the 2 pods.

pankaj.tolani@tolani-mac  ~/afterpay/inception/cilium/alpha  kl coredns-74fcbd4cb4-k4dhc .:53 [INFO] plugin/reload: Running configuration MD5 = 47d57903c0f0ba4ee0626a17181e5d94 CoreDNS-1.7.0 linux/amd64, go1.13.15, f59c03d0 [ERROR] plugin/errors: 2 1687429144305681147.3333962346685544537. HINFO: read udp 10.240.35.84:60370->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 1687429144305681147.3333962346685544537. HINFO: read udp 10.240.35.84:48293->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 1687429144305681147.3333962346685544537. HINFO: read udp 10.240.35.84:49938->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 1687429144305681147.3333962346685544537. HINFO: read udp 10.240.35.84:38861->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 1687429144305681147.3333962346685544537. HINFO: read udp 10.240.35.84:56928->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 1687429144305681147.3333962346685544537. HINFO: read udp 10.240.35.84:52537->10.240.0.2:53: i/o timeout pankaj.tolani@tolani-mac  ~/afterpay/inception/cilium/alpha  kl coredns-74fcbd4cb4-x68m8 .:53 [INFO] plugin/reload: Running configuration MD5 = 47d57903c0f0ba4ee0626a17181e5d94 CoreDNS-1.7.0 linux/amd64, go1.13.15, f59c03d0 [ERROR] plugin/errors: 2 5763478806533751487.8973578589187692515. HINFO: read udp 10.240.29.242:47668->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5763478806533751487.8973578589187692515. HINFO: read udp 10.240.29.242:48540->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5763478806533751487.8973578589187692515. HINFO: read udp 10.240.29.242:57593->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5763478806533751487.8973578589187692515. HINFO: read udp 10.240.29.242:37493->10.240.0.2:53: i/o timeout [ERROR] plugin/errors: 2 5763478806533751487.8973578589187692515. HINFO: read udp 10.240.29.242:42574->10.240.0.2:53: i/o timeout I0308 03:17:11.104617 1 trace.go:116] Trace[1427131847]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-08 03:16:41.103761797 +0000 UTC m=+0.020357165) (total time: 30.000769225s): Trace[1427131847]: [30.000769225s] [30.000769225s] END E0308 03:17:11.104647 1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Namespace: Get https://172.20.0.1:443/api/v1/namespaces?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: i/o timeout I0308 03:17:11.105067 1 trace.go:116] Trace[911902081]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-08 03:16:41.104625732 +0000 UTC m=+0.021221092) (total time: 30.000420935s): Trace[911902081]: [30.000420935s] [30.000420935s] END E0308 03:17:11.105080 1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Endpoints: Get https://172.20.0.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: i/o timeout I0308 03:17:11.105165 1 trace.go:116] Trace[1474941318]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125 (started: 2021-03-08 03:16:41.104545953 +0000 UTC m=+0.021141323) (total time: 30.000607402s): Trace[1474941318]: [30.000607402s] [30.000607402s] END E0308 03:17:11.105172 1 reflector.go:178] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:125: Failed to list *v1.Service: Get https://172.20.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 172.20.0.1:443: i/o timeout

Thoughts?

jayanthvn · 2021-03-08T07:22:21Z

Hi @pankajmt

Which image version are you using since Rel 1.7.9 has the fix for #1265 - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.7.9.

tgraf · 2021-03-08T08:14:33Z

Can we point to a particular set of EKS releases in the Cilium docs somehow? What versions of EKS will ship with 1.7.9?

part-time-githubber · 2021-03-08T11:24:29Z

I am on aws cni 1.7.8.

amazon-k8s-cni-init:v1.7.8 amazon-k8s-cni:v1.7.8

So looks like there is hope assuming the EKS version we need is GA in our region. While docs improve, someone knows the EKS version I should be looking for?

Many thanks,
Pankaj

jayanthvn · 2021-03-08T17:32:21Z

Hi,

Yeah that would be great if Cilium docs can point to EKS CNI versions and if there is any known issue it would be easy for Cx to fallback or look for new versions. Currently EKS default CNI version is 1.7.5 with new clusters. Will keep you updated if we plan to make 1.7.9 or later versions default for EKS.

Thank you!

part-time-githubber · 2021-03-08T21:58:59Z

So looks like then ours is a custom install of the EKS CNI. I will figure out how it was done and how can I upgrade it to 1.7.9.

part-time-githubber · 2021-03-09T23:10:37Z

worked well with aws cni 1.7.9. many thanks.

YesemKebede added the bug label Oct 19, 2020

jayanthvn added the priority/P1 Must be staffed and worked currently or soon. Is a candidate for next release label Oct 19, 2020

jayanthvn assigned couralex6 Oct 19, 2020

jayanthvn changed the title ~~New pods failing to start with FailedCreatePodSandBox warning for CNI versions 1.7.x~~ New pods failing to start with FailedCreatePodSandBox warning for CNI versions 1.7.x with Cilium Oct 27, 2020

couralex6 mentioned this issue Nov 2, 2020

Output to stderr when no log file path is passed #1275

Merged

jayanthvn closed this as completed Nov 20, 2020

jayanthvn mentioned this issue Dec 7, 2020

Coredns stuck on ContainerCreating with FailedCreatePodSandBox warning for CNI versions 1.7.6 with Cilium 1.9.1 #1314

Closed

couralex6 mentioned this issue Jan 30, 2021

Output to stderr when no log file path is passed #1365

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New pods failing to start with `FailedCreatePodSandBox` warning for CNI versions 1.7.x with Cilium #1265

New pods failing to start with `FailedCreatePodSandBox` warning for CNI versions 1.7.x with Cilium #1265

YesemKebede commented Oct 19, 2020 •

edited

Loading

jayanthvn commented Oct 19, 2020

YesemKebede commented Oct 19, 2020

jayanthvn commented Oct 19, 2020

jayanthvn commented Oct 19, 2020

YesemKebede commented Oct 19, 2020 •

edited

Loading

sophomeric commented Oct 22, 2020

jayanthvn commented Oct 22, 2020

Aggouri commented Oct 26, 2020

jayanthvn commented Oct 26, 2020

Aggouri commented Oct 26, 2020

jayanthvn commented Oct 26, 2020

Arsen-Uulu commented Oct 27, 2020 •

edited

Loading

jayanthvn commented Oct 27, 2020 •

edited

Loading

jayanthvn commented Nov 20, 2020

bogarcia commented Nov 24, 2020

part-time-githubber commented Mar 8, 2021

jayanthvn commented Mar 8, 2021

tgraf commented Mar 8, 2021

part-time-githubber commented Mar 8, 2021

jayanthvn commented Mar 8, 2021

part-time-githubber commented Mar 8, 2021

part-time-githubber commented Mar 9, 2021 •

edited

Loading

New pods failing to start with FailedCreatePodSandBox warning for CNI versions 1.7.x with Cilium #1265

New pods failing to start with FailedCreatePodSandBox warning for CNI versions 1.7.x with Cilium #1265

Comments

YesemKebede commented Oct 19, 2020 • edited Loading

jayanthvn commented Oct 19, 2020

YesemKebede commented Oct 19, 2020

jayanthvn commented Oct 19, 2020

jayanthvn commented Oct 19, 2020

YesemKebede commented Oct 19, 2020 • edited Loading

sophomeric commented Oct 22, 2020

jayanthvn commented Oct 22, 2020

Aggouri commented Oct 26, 2020

jayanthvn commented Oct 26, 2020

Aggouri commented Oct 26, 2020

jayanthvn commented Oct 26, 2020

Arsen-Uulu commented Oct 27, 2020 • edited Loading

jayanthvn commented Oct 27, 2020 • edited Loading

jayanthvn commented Nov 20, 2020

bogarcia commented Nov 24, 2020

part-time-githubber commented Mar 8, 2021

jayanthvn commented Mar 8, 2021

tgraf commented Mar 8, 2021

part-time-githubber commented Mar 8, 2021

jayanthvn commented Mar 8, 2021

part-time-githubber commented Mar 8, 2021

part-time-githubber commented Mar 9, 2021 • edited Loading

New pods failing to start with `FailedCreatePodSandBox` warning for CNI versions 1.7.x with Cilium #1265

New pods failing to start with `FailedCreatePodSandBox` warning for CNI versions 1.7.x with Cilium #1265

YesemKebede commented Oct 19, 2020 •

edited

Loading

YesemKebede commented Oct 19, 2020 •

edited

Loading

Arsen-Uulu commented Oct 27, 2020 •

edited

Loading

jayanthvn commented Oct 27, 2020 •

edited

Loading

part-time-githubber commented Mar 9, 2021 •

edited

Loading