Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coredns stuck on ContainerCreating with FailedCreatePodSandBox warning for CNI versions 1.7.6 with Cilium 1.9.1 #1314

Closed
mmochan opened this issue Dec 7, 2020 · 28 comments
Labels

Comments

@mmochan
Copy link

mmochan commented Dec 7, 2020

What happened:
New cluster with nodes restarted.
coredns stuck on ContainerCreating when using CNI v1.7.6 and Cilium 1.9.1.
Other pods are also experiencing the same behavior ( ContainerCreating )

coredns:v1.6.6-eksbuild.1

Attach logs

Warning  FailedCreatePodSandBox  29s  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "112861d5995ca8f44c1dc17f00c947d72a44cf69c9deda34fbaf56b204742874" network for pod "coredns-6d857998c6-gxsd7": networkPlugin cni failed to set up pod "coredns-6d857998c6-gxsd7_kube-system" network: invalid character '{' after top-level value

What you expected to happen:
I expected coredns and other pods to be in running state

How to reproduce it (as minimally and precisely as possible):
Deploy cni version 1.7.6 and cilium 1.9.1 on EKS 1.17

Anything else we need to know?:
We have Cilium running in chaining mode (v1.9.1)
[(https://docs.cilium.io/en/v1.9/gettingstarted/cni-chaining-aws-cni/)]

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.3", GitCommit:"1e11e4a2108024935ecfcb2912226cedeafd99df", GitTreeState:"clean", BuildDate:"2020-10-14T12:50:19Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-eks-7684af", GitCommit:"7684af4ac41370dd109ac13817023cb8063e3d45", GitTreeState:"clean", BuildDate:"2020-10-20T22:57:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • CNI Version
amazon-k8s-cni-init:v1.7.6
amazon-k8s-cni:v1.7.6
  • OS (e.g: cat /etc/os-release):
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
@mmochan mmochan added the bug label Dec 7, 2020
@jayanthvn
Copy link
Contributor

Hi @mmochan ,

Can you please check if you are hitting this issue - #1265. RC for this issue - #1265 (comment).

Thanks.

@mmochan
Copy link
Author

mmochan commented Dec 7, 2020

Hi jayanthvn

AWS_VPC_K8S_PLUGIN_LOG_FILE is being set as expected

    Environment:
      ADDITIONAL_ENI_TAGS:                 {}
      AWS_VPC_CNI_NODE_PORT_SUPPORT:       true
      AWS_VPC_ENI_MTU:                     9001
      AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER:  false
      AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG:  false
      AWS_VPC_K8S_CNI_EXTERNALSNAT:        false
      AWS_VPC_K8S_CNI_LOGLEVEL:            DEBUG
      AWS_VPC_K8S_CNI_LOG_FILE:            /host/var/log/aws-routed-eni/ipamd.log
      AWS_VPC_K8S_CNI_RANDOMIZESNAT:       prng
      AWS_VPC_K8S_CNI_VETHPREFIX:          eni
      AWS_VPC_K8S_PLUGIN_LOG_FILE:         /var/log/aws-routed-eni/plugin.log
      AWS_VPC_K8S_PLUGIN_LOG_LEVEL:        DEBUG
      DISABLE_INTROSPECTION:               false
      DISABLE_METRICS:                     false
      ENABLE_POD_ENI:                      false
      MY_NODE_NAME:                         (v1:spec.nodeName)
      WARM_ENI_TARGET:                     1

But /host/etc/cni/net.d/05-cilium.conflist doesn't match 05-cilium.conflist in issue - [#1265]

{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

Thanks

@jayanthvn
Copy link
Contributor

Hi @mmochan

Yes you will have to add these 2 lines -

"pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
 "pluginLogLevel": "Debug"

Something like this -

{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni"
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "Debug"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

@mmochan
Copy link
Author

mmochan commented Dec 8, 2020

Hi @jayanthvn

Great that works, all pods now running.

Are you able to give an ETA for a permanent fix?

Thanks for your help.

Mike

@jayanthvn
Copy link
Contributor

Good to know it works, #1275 is merged and we are planning for the next release, I will provide you the dates in a week or so.

@mmochan
Copy link
Author

mmochan commented Dec 8, 2020

Thanks again @jayanthvn

@kovalyukm
Copy link

kovalyukm commented Dec 14, 2020

Hi @mmochan

Yes you will have to add these 2 lines -

"pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
 "pluginLogLevel": "Debug"

Something like this -

{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni"
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "Debug"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

Unfortunately does not work for me.

Containers stuck with another error like:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "bcea7cf2eac9dc94fed5316d1cde99c999102f732c3e32708bf5e1e05c666086" network for pod "coredns-59458dc98-7fqnj": networkPlugin cni failed to set up pod "coredns-59458dc98-7fqnj_kube-system" network: unable to create endpoint: Cilium API client timeout exceeded

And there are errors with stack traces in cilium-agent on the node like:
2020-12-14T16:17:03.756239102Z level=warning msg="Error fetching program/map!" subsys=datapath-loader
2020-12-14T16:17:03.756242896Z level=warning msg="Unable to load program" subsys=datapath-loader
2020-12-14T16:17:03.756820568Z level=warning msg="JoinEP: Failed to load program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 error="Failed to load tc filter: exit status 1" file-path=529_next/bpf_lxc.o identity=16387 ipv4= ipv6= k8sPodName=/ subsys=datapath-loader veth=eni684e9679747
2020-12-14T16:17:03.756830693Z level=error msg="Error while rewriting endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 error="Failed to load tc filter: exit status 1" identity=16387 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2020-12-14T16:17:03.756870133Z level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 file-path=529_next_fail identity=16387 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2020-12-14T16:17:03.757081515Z level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=6.262089438s bpfWaitForELF="4.951µs" bpfWriteELF="150.352µs" containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 error="Failed to load tc filter: exit status 1" identity=16387 ipv4= ipv6= k8sPodName=/ mapSync="6.33µs" policyCalculation="9.073µs" prepareBuild="304.815µs" proxyConfiguration="12.346µs" proxyPolicyCalculation="25.738µs" proxyWaitForAck=0s reason="retrying regeneration" subsys=endpoint total=6.263766628s waitingForCTClean="465.631µs" waitingForLock="4.574µs"
2020-12-14T16:17:03.757217667Z level=error msg="endpoint regeneration failed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=17 endpointID=529 error="Failed to load tc filter: exit status 1" identity=16387 ipv4= ipv6= k8sPodName=/ subsys=endpoint
2020-12-14T16:17:11.932298540Z level=error msg="Command execution failed" cmd="[tc filter replace dev eni684e9679747 egress prio 1 handle 1 bpf da obj 529_next/bpf_lxc.o sec to-container]" error="exit status 1" subsys=datapath-loader

Could you please help?

btw, is there docker image with the fix above to check it?

Thanks!

p.s. this is my issue cilium/cilium#14379 (comment)
I upgraded to:
Kubernetes version | 1.18
Amazon VPC CNI plug-in | 1.7.5
DNS (CoreDNS) | 1.7.0
KubeProxy | 1.18.9

@kovalyukm
Copy link

CNI Plugin v1.7 does not work with Cilium 1.9!
I've tested on created EKS cluster from scratch.
The workaround above does not help.

@jayanthvn @mmochan Could you please re-check?
Thanks!

@couralex6
Copy link
Contributor

Hi @kovalyukm,

I was just able to run Celium 1.9 in chaining mode with CNIv1.7.5. I added the the following lines to /etc/cni/net.d/05-cilium.conflist (as mentioned in this comment).

"mtu": "9001",
"pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
"pluginLogLevel": "Debug"

Can you make sure:

  • you correctly updated /etc/cni/net.d/05-cilium.conflist on your instances
  • You installed Celium in chaining mode, as described here

Please let me know if that works.

@kovalyukm
Copy link

Hi @couralex6 ,

  • Yes, I've updated /etc/cni/net.d/05-cilium.conflist as described:
cat /etc/cni/net.d/05-cilium.conflist
{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni",
      "mtu": "9001",
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "Debug"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

and tried like in:

cat /etc/cni/net.d/10-aws.conflist
{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni",
      "mtu": "9001",
      "pluginLogFile": "/var/log/aws-routed-eni/plugin.log",
      "pluginLogLevel": "DEBUG"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    }
  ]
}
  • And installed Cilium in chaining mode like:
    cni:
#      customConf: true
      chainingMode: aws-cni
    masquerade: false

    tunnel: disabled
    nodeinit:
      # enables node initialization DaemonSet
      enabled: true

Maybe there is issue in software versions. I use EKS Kubernetes 1.18 and Cilium 1.9.1.
Have you tried with these versions?

Thanks!

@couralex6
Copy link
Contributor

@kovalyukm

It was also an EKS 1.18 cluster.

Your /etc/cni/net.d/10-aws.conflist looks right. You shouldn't have to modify it though.

Did you install Celium through Helm as described here: https://docs.cilium.io/en/v1.9/gettingstarted/cni-chaining-aws-cni/ ?

@kovalyukm
Copy link

@couralex6

Seems you use Cilium 1.9.0.

Yes, I use Cilium doc to manage it.

The workaround works with Cilium 1.9.0, but doesn't work with Cilium 1.9.1. (Seems this version is broken - cilium/cilium#14403 (comment))

Thanks, waiting for CNIv1.7.8 with fix.

@kovalyukm
Copy link

@couralex6 @jayanthvn

CNIv1.7.8 does not work, the same error like "invalid character '{' after top-level value".

@jayanthvn
Copy link
Contributor

jayanthvn commented Dec 17, 2020

Hi @kovalyukm

Sure we will try Cilium 1.9.1 and get back to you. But @mmochan has tried with Cilium 1.9.1 and the recommended work around.

@shaikatz
Copy link

@jayanthvn what is the ETA to release that fix that doesn't require any manual modification of the nodes? 2 versions already released after this PR was merged, but this fix was ignored in both of them.

@jayanthvn
Copy link
Contributor

Hi @shaikatz

Sorry for the delay. Will take this up in rel 1.7.9 planned for January.

@mmochan
Copy link
Author

mmochan commented Jan 7, 2021

Hi @jayanthvn,

Can you give an ETA on 1.7.9 release date?

Thanks

Mike

@springroll12
Copy link

Any movement on this? This is a blocker for us as well.

@jayanthvn
Copy link
Contributor

Thanks fo your patience and sorry for the delay. We are working on the release, it will be part of 1.8. There are other changes which needs to be tested hence we are still working on the timeline. I will keep you all updated often.

@jayanthvn
Copy link
Contributor

Hi,

We have integrated the fix as part of 1.7.9. Release candidate is out - https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.7.9-rc1. 1.7.9 Release should be out this week if there are no issues with the deployment. Will update if the date changes.

Thanks.

@mmochan
Copy link
Author

mmochan commented Feb 18, 2021

@jayanthvn I have applied 1.7.9 and the same issue exists when new nodes are added.

kubectl describe daemonset aws-node -n kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.7.9
amazon-k8s-cni:v1.7.9
cat /host/etc/cni/net.d/05-cilium.conflist
{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

@couralex6
Copy link
Contributor

Hi @mmochan,

Could give me a little more context around your issue?

  • Which cilium version are you installing?
  • Which EKS version?
  • What steps are you following to add new nodes to you cluster?
  • What's in 10-aws.conflist on a node with and without the issue?

I just ran the following test to confirm the fix was working:

Created a new 1.18 EKS cluster, which was running CNI 1.7.5 (default version). Then installed cilium 1.9.4, which broke the cluster as expected. Then installed 1.7.9, which solved the issue on all node. I haven't tried adding new nodes yet. Waiting for your response to tried reproducing.

Thanks

@mmochan
Copy link
Author

mmochan commented Feb 19, 2021

Hi @couralex6

It turns out my CNI upgrade was being overwritten.

I can confirm that it is now working.

Apologies...
Mike

@couralex6
Copy link
Contributor

Awesome, glad to hear it's working @mmochan!

@kovalyukm
Copy link

Hello, @couralex6,

Seems something wrong with CNI 1.7.9 and cilium 1.9.4 on 1.19 EKS cluster (kube-proxy:v1.19.6-eksbuild.2, AMI v1.19.6-eks-49a6c0). Assigning of IPs to pods are working, but there are some connectivity issue and the most pods are restarting with probes failed and timeouts.

Have you tested it on 1.19 EKS cluster, is everything fine?

Thanks!

@couralex6
Copy link
Contributor

Hi @kovalyukm , I just tested again on EKS 1.19 with CNI v1.7.9 chained with both Cilium v1.9.3 and v1.9.4.

I deployed a sample Nginx deployment and performed basic connectivity tests (ping between pods on same node and across nodes). Everything looked fine and I am not seeing any failed probes or timeouts.

Are you still experiencing the issue?

@kovalyukm
Copy link

Hi @couralex6 , thank you for testing.

Yes, there is an issue on new nodes in cluster after replacing nodes with Cilium 1.9.0 version. (i/o timeout in kube-dns logs to VPC dns-servers, EKS API, so on; probes failed of pods and pod CrashLoopBackOff)

Upgrading to 1.9.4 fixes the issue, but if there are an old nodes another issue appears here with pod creation (cilium-agents trow exceptions in logs).

Thanks!

@mmochan
Copy link
Author

mmochan commented Mar 15, 2021

Hi @couralex6,

Today our EKS cluster had a node refresh and we are now facing the same issues originally reported in this issue/ticket.

We have been running Cilium 1.9.1 and CNI 1.7.9 on EKS v1.18.9 for the last 3 weeks, but as a result of the node refresh CNI is not populating 05-cilium.conflist correctly

cat /host/etc/cni/net.d/05-cilium.conflist
{
  "cniVersion": "0.3.1",
  "name": "aws-cni",
  "plugins": [
    {
      "name": "aws-cni",
      "type": "aws-cni",
      "vethPrefix": "eni"
    },
    {
      "type": "portmap",
      "capabilities": {"portMappings": true},
      "snat": true
    },
    {
       "name": "cilium",
       "type": "cilium-cni",
       "enable-debug": false
    }
  ]
}

We upgraded to Cilium 1.9.5 and refreshed all the nodes hoping that might fix it, but we are still facing the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants