Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP offloading on vxlan.calico adaptor causing 63 second delays in VXLAN communications node->nodeport or node->clusterip:port. #3145

Closed
iamcaje opened this issue Jan 20, 2020 · 31 comments · Fixed by projectcalico/felix#2811

Comments

@iamcaje
Copy link

iamcaje commented Jan 20, 2020

I am experiencing a 63 second delay in VXLAN communications node->node:nodeport or node->clusterip:port. After inspecting pcaps on both sending and receiving nodes it appears related to TCP offloading on the vxlan.calico interface. Disabling this through ethtool appears to 'resolve' the issue, but I'm entirely unsure whether this is a good idea or not, or if there's a better fix?

Expected Behavior

From a node, the following should work in all cases:

curl localhost: [nodeport]
curl [nodeip]:[nodeport]
curl [clusterip]:[port]

Current Behavior

consider the following:

A: node running pod [web], a simple web service (containous/whoami)
B: node running pod [alpine], a base container; exec sh
C: node not doing anything in particular.

With service defined:

$ kubectl describe svc whoami-cluster-nodeport
  ...
    Type:                     NodePort
    NodePort:                 web  32081/TCP
    External Traffic Policy:  Cluster

I get the following results when trying to access the web service:

from/to [web_ip]:80 [cluster_ip]:80 localhost:32081 [a_ip]:32081 [b_ip]:32081 [c_ip]:32081
node A ok ok ok ok ok ok
node B ok 63 seconds 63 seconds ok 63 seconds ok
node C ok 63 seconds 63 seconds ok ok 63 seconds
pod alpine ok ok - ok 63 seconds ok
external host - - - ok ok ok

Further, if I change replicas for the pods so that pods are on A & C (rr load balancing b/t 2 nodes), the 63 second delay will occur half of the time from those hosting nodes:

  • from A: curl localhost:32081
  • from A: curl [node_a_ip]:32081

The problem seems to stem from trying to route sourced from a node to another node. I did a trace and tcp dump from C -> A (via localhost:32081 on C... see below). On both nodes, the TCPDUMP shows repeated SYN packets attempting to establish the connection. They all show "bad udp cksum 0xffff -> 0x76dc!" in the results. After 63 seconds, a SYN packet is set with 'no cksum' and the connection is established.

I had to disable TCP Offloading... after issuing this command, curl localhost:32081 worked consistently on all nodes.

# ethtool --offload vxlan.calico rx off tx off
Actual changes:
rx-checksumming: off
tx-checksumming: off
        tx-checksum-ip-generic: off
tcp-segmentation-offload: off
        tx-tcp-segmentation: off [requested on]
        tx-tcp-ecn-segmentation: off [requested on]
        tx-tcp6-segmentation: off [requested on]
        tx-tcp-mangleid-segmentation: off [requested on]
udp-fragmentation-offload: off [requested on]

So.... I'm entirely unsure whether this is a good idea or not? Or whether there's a way to fix this through iptables? Or whether this needs fixed in the OS/hosting (VSphere)?

TRACE from Node C (client)

# sudo perf trace --no-syscalls --event 'net:*' wget -q -O /dev/null localhost:32081
     0.000 net:net_dev_queue:dev=vxlan.calico skbaddr=0xffff8c6ed93d00f8 len=66
     0.033 net:net_dev_queue:dev=eth0 skbaddr=0xffff8c6ed93d00f8 len=116
     0.084 net:net_dev_xmit:dev=eth0 skbaddr=0xffff8c6ed93d00f8 len=116 rc=0
     0.088 net:net_dev_xmit:dev=vxlan.calico skbaddr=0xffff8c6ed93d00f8 len=66     rc=0
 63122.053 net:net_dev_queue:dev=vxlan.calico skbaddr=0xffff8c607b6be0f8     len=165
 63122.070 net:net_dev_queue:dev=eth0 skbaddr=0xffff8c607b6be0f8 len=215
 63122.078 net:net_dev_xmit:dev=eth0 skbaddr=0xffff8c607b6be0f8 len=215 rc=0
 63122.080 net:net_dev_xmit:dev=vxlan.calico skbaddr=0xffff8c607b6be0f8     len=165 rc=0
 63123.135 net:net_dev_queue:dev=vxlan.calico skbaddr=0xffff8c607b6b90f8 len=54
 63123.154 net:net_dev_queue:dev=eth0 skbaddr=0xffff8c607b6b90f8 len=104
 63123.162 net:net_dev_xmit:dev=eth0 skbaddr=0xffff8c607b6b90f8 len=104 rc=0
 63123.165 net:net_dev_xmit:dev=vxlan.calico skbaddr=0xffff8c607b6b90f8 len=54 rc=0

TCPDUMP from Node C (client)

# tcpdump -vv host [node_a_ip]
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:51:31.656624 IP (tos 0x0, ttl 64, id 59791, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8153, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:32.657185 IP (tos 0x0, ttl 64, id 60020, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8154, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:34.661166 IP (tos 0x0, ttl 64, id 61933, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8155, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:36.669116 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has nodeA.local tell nodeC.local, length 28
11:51:36.669385 ARP, Ethernet (len 6), IPv4 (len 4), Reply nodeA.local is-at 00:50:56:a3:d4:91 (oui Unknown), length 46
11:51:38.669155 IP (tos 0x0, ttl 64, id 65370, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8156, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:46.685204 IP (tos 0x0, ttl 64, id 3142, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8157, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:02.733178 IP (tos 0x0, ttl 64, id 5028, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8158, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797203 IP (tos 0x0, ttl 64, id 30608, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8159, offset 0, flags [DF], proto TCP (6), length 52)
    nodeC.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797731 IP (tos 0x0, ttl 64, id 33394, offset 0, flags [none], proto UDP (17), length 102)
    nodeA.local.36276 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.143.193.http > nodeC.39274: Flags [S.], cksum 0x4348 (correct), seq 4013223269, ack 2295512835, win 28000, options [mss 1400,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797854 IP (tos 0x0, ttl 64, id 30609, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8160, offset 0, flags [DF], proto TCP (6), length 40)
    nodeC.39274 > 10.244.143.193.http: Flags [.], cksum 0xefe8 (correct), seq 1, ack 1, win 342, length 0
11:52:34.797936 IP (tos 0x0, ttl 64, id 30610, offset 0, flags [none], proto UDP (17), length 203)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8161, offset 0, flags [DF], proto TCP (6), length 153)
    nodeC.39274 > 10.244.143.193.http: Flags [P.], cksum 0x7ffc (correct), seq 1:114, ack 1, win 342, length 113: HTTP, length: 113
        GET / HTTP/1.1
        User-Agent: Wget/1.14 (linux-gnu)
        Accept: */*
        Host: localhost:32081
        Connection: Keep-Alive

11:52:34.798117 IP (tos 0x0, ttl 64, id 33395, offset 0, flags [none], proto UDP (17), length 90)
    nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26251, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.143.193.http > nodeC.39274: Flags [.], cksum 0xeff2 (correct), seq 1, ack 114, win 219, length 0
11:52:34.798547 IP (tos 0x0, ttl 64, id 33396, offset 0, flags [none], proto UDP (17), length 458)
    nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26252, offset 0, flags [DF], proto TCP (6), length 408)
    10.244.143.193.http > nodeC.39274: Flags [P.], cksum 0xe168 (correct), seq 1:369, ack 114, win 219, length 368: HTTP, length: 368
        HTTP/1.1 200 OK
        Date: Mon, 20 Jan 2020 16:52:34 GMT
        Content-Length: 250
        Content-Type: text/plain; charset=utf-8

        Hostname: whoami-66686d967d-mzk8p
        IP: 127.0.0.1
        IP: ::1
        IP: 10.244.143.193
        IP: fe80::d830:89ff:fe7f:2703
        RemoteAddr: 10.244.90.192:39274
        GET / HTTP/1.1
        Host: localhost:32081
        User-Agent: Wget/1.14 (linux-gnu)
        Accept: */*
        Connection: Keep-Alive

11:52:34.798602 IP (tos 0x0, ttl 64, id 30611, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8162, offset 0, flags [DF], proto TCP (6), length 40)
    nodeC.39274 > 10.244.143.193.http: Flags [.], cksum 0xedff (correct), seq 114, ack 369, win 350, length 0
11:52:34.799551 IP (tos 0x0, ttl 64, id 30612, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8163, offset 0, flags [DF], proto TCP (6), length 40)
    nodeC.39274 > 10.244.143.193.http: Flags [F.], cksum 0xedfe (correct), seq 114, ack 369, win 350, length 0
11:52:34.799731 IP (tos 0x0, ttl 64, id 33397, offset 0, flags [none], proto UDP (17), length 90)
    nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26253, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.143.193.http > nodeC.39274: Flags [F.], cksum 0xee80 (correct), seq 369, ack 115, win 219, length 0
11:52:34.799779 IP (tos 0x0, ttl 64, id 30613, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8164, offset 0, flags [DF], proto TCP (6), length 40)
    nodeC.39274 > 10.244.143.193.http: Flags [.], cksum 0xedfd (correct), seq 115, ack 370, win 350, length 0
11:52:39.805144 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has nodeA.local tell nodeC.local, length 28
11:52:39.805399 ARP, Ethernet (len 6), IPv4 (len 4), Reply nodeA.local is-at 00:50:56:a3:d4:91 (oui Unknown), length 46

TCPDUMP from Node A (hosting service pod)

# tcpdump -vv host [node_c_ip]
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:51:31.656705 IP (tos 0x0, ttl 64, id 59791, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8153, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:32.657297 IP (tos 0x0, ttl 64, id 60020, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8154, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:34.661301 IP (tos 0x0, ttl 64, id 61933, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8155, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:36.669200 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has nodeA.local tell nodeC.local, length 46
11:51:36.669224 ARP, Ethernet (len 6), IPv4 (len 4), Reply nodeA.local is-at 00:50:56:a3:d4:91 (oui Unknown), length 28
11:51:38.669297 IP (tos 0x0, ttl 64, id 65370, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8156, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:51:46.685305 IP (tos 0x0, ttl 64, id 3142, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8157, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:02.733268 IP (tos 0x0, ttl 64, id 5028, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.26898 > nodeA.local.4789: [bad udp cksum 0xffff -> 0x76dc!] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8158, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797359 IP (tos 0x0, ttl 64, id 30608, offset 0, flags [none], proto UDP (17), length 102)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8159, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [S], cksum 0xe849 (correct), seq 2295512834, win 43690, options [mss 65495,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797533 IP (tos 0x0, ttl 64, id 33394, offset 0, flags [none], proto UDP (17), length 102)
    nodeA.local.36276 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.244.143.193.http > 10.244.90.192.39274: Flags [S.], cksum 0x4348 (correct), seq 4013223269, ack 2295512835, win 28000, options [mss 1400,nop,nop,sackOK,nop,wscale 7], length 0
11:52:34.797895 IP (tos 0x0, ttl 64, id 30609, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8160, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [.], cksum 0xefe8 (correct), seq 1, ack 1, win 342, length 0
11:52:34.797960 IP (tos 0x0, ttl 64, id 30610, offset 0, flags [none], proto UDP (17), length 203)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8161, offset 0, flags [DF], proto TCP (6), length 153)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [P.], cksum 0x7ffc (correct), seq 1:114, ack 1, win 342, length 113: HTTP, length: 113
        GET / HTTP/1.1
        User-Agent: Wget/1.14 (linux-gnu)
        Accept: */*
        Host: localhost:32081
        Connection: Keep-Alive

11:52:34.797995 IP (tos 0x0, ttl 64, id 33395, offset 0, flags [none], proto UDP (17), length 90)
    nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26251, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.143.193.http > 10.244.90.192.39274: Flags [.], cksum 0xeff2 (correct), seq 1, ack 114, win 219, length 0
11:52:34.798460 IP (tos 0x0, ttl 64, id 33396, offset 0, flags [none], proto UDP (17), length 458)
    nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26252, offset 0, flags [DF], proto TCP (6), length 408)
    10.244.143.193.http > 10.244.90.192.39274: Flags [P.], cksum 0xe168 (correct), seq 1:369, ack 114, win 219, length 368: HTTP, length: 368
        HTTP/1.1 200 OK
        Date: Mon, 20 Jan 2020 16:52:34 GMT
        Content-Length: 250
        Content-Type: text/plain; charset=utf-8

        Hostname: whoami-66686d967d-mzk8p
        IP: 127.0.0.1
        IP: ::1
        IP: 10.244.143.193
        IP: fe80::d830:89ff:fe7f:2703
        RemoteAddr: 10.244.90.192:39274
        GET / HTTP/1.1
        Host: localhost:32081
        User-Agent: Wget/1.14 (linux-gnu)
        Accept: */*
        Connection: Keep-Alive

11:52:34.798635 IP (tos 0x0, ttl 64, id 30611, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8162, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [.], cksum 0xedff (correct), seq 114, ack 369, win 350, length 0
11:52:34.799589 IP (tos 0x0, ttl 64, id 30612, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8163, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [F.], cksum 0xedfe (correct), seq 114, ack 369, win 350, length 0
11:52:34.799655 IP (tos 0x0, ttl 64, id 33397, offset 0, flags [none], proto UDP (17), length 90)
    nodeA.local.37789 > nodeC.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 63, id 26253, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.143.193.http > 10.244.90.192.39274: Flags [F.], cksum 0xee80 (correct), seq 369, ack 115, win 219, length 0
11:52:34.799814 IP (tos 0x0, ttl 64, id 30613, offset 0, flags [none], proto UDP (17), length 90)
    nodeC.local.60563 > nodeA.local.4789: [no cksum] VXLAN, flags [I] (0x08), vni 4096
IP (tos 0x0, ttl 64, id 8164, offset 0, flags [DF], proto TCP (6), length 40)
    10.244.90.192.39274 > 10.244.143.193.http: Flags [.], cksum 0xedfd (correct), seq 115, ack 370, win 350, length 0
11:52:39.805231 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has nodeA.local tell nodeC.local, length 46
11:52:39.805256 ARP, Ethernet (len 6), IPv4 (len 4), Reply nodeA.local is-at 00:50:56:a3:d4:91 (oui Unknown), length 28

Possible Solution

# ethtool --offload vxlan.calico rx off tx off

Steps to Reproduce (for bugs)

  1. kubeadm init
  2. install calico configured for vxlan
  3. install a simple web pod and a nodeport service
  4. attempt to access nodeport from cluster

Context

There are times when we need to be able to access a service from a node (eg, log shipping from the node to a hosted service, hosted app api access, k8s hosted registry) where this defect will interfere with normal communications

Your Environment

  • Calico version: v3.11
  • Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.17
  • Operating System and version: RHEL 7 (version 3.10.0-1062.9.1.el7.x86_64)
  • Link to your project (optional):
@KarlHerler
Copy link

KarlHerler commented Jan 26, 2020

Hello,

We have the exact same issue on CentOS 7 (3.10.0-1062.9.1.el7.x86_64) running Calico/Canal and flannel. After spending the better part of a week trying to figure out why two thirds of our cluster was unable to reliably to talk to the other third I stumbled upon this issue.

I can report that disabling offloading completely works around this issue. In our case, the command is: sudo ethtool --offload flannel.1 rx off tx off (because we are running flannel).

Your Environment

  • Calico version: v3.11
  • Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.17.1
  • Operating System and version: CentOS 7 (3.10.0-1062.9.1.el7.x86_64)

@3cky
Copy link

3cky commented Feb 6, 2020

Same issue here, we have 63 second delay while connecting to ClusterIP services in CentOS 7 / k8s 1.7.2 / Calico 3.12.0 cluster running on Hetzner Cloud. Disabling ethernet offloading resolves mentioned connection issues.

@phantooom
Copy link

phantooom commented Feb 6, 2020

ethtool --offload interface rx off tx off

thx it works. flannel+k8s1.17.2 centos7.
but how do you know disable tcp offload works?

@davesargrad
Copy link

davesargrad commented Mar 20, 2020

@jelaryma I am having a similar problem. kubernetes/kubernetes#88986
I also measured the 63 second delay and I am using flannel.

I came to think it was a flannel issue, but you are also seeing this with calico..
flannel-io/flannel#1268

Is it your thought that this is a k8s bug?

Also would i run the following command on every node of the cluster?
sudo ethtool --offload flannel.1 rx off tx off

@caseydavenport
Copy link
Member

There's a thread in SIG network about this: https://groups.google.com/forum/#!topic/kubernetes-sig-network/JxkTLd4M8WM

Summary so far seems to be that this is a kernel bug related to VXLAN offload where the checksum calculation is not properly offloaded.

@hien
Copy link

hien commented Mar 30, 2020

Same issue with k8s 1.18 + ubuntu 16.04 /w calico 3.11, it causes 3s delays only.

ethtool --offload interface rx off tx off can workaround perfectly!

@gamer22026
Copy link

Flannel now has an open PR addressing this. flannel-io/flannel#1282

@zhangguanzhang
Copy link
Contributor

see this flannel-io/flannel#1282 (comment)

@giordyb
Copy link

giordyb commented Jun 12, 2020

I have been hit by this bug as well. Does anyone know where to add the ethtool command to make it persistent after a reboot on Centos7? I tried adding it to rc.local but it looks like the device is being created after the script runs because I am getting a Cannot get device feature names: No such device error..

@Bowser1704
Copy link

@jelaryma
After 63 seconds, a SYN packet is set with 'no cksum' and the connection is established.

63 seconds maybe refers to 5 times retransmission. But i'm still confused about the cause of this issue. So do you kown some articles or blog posts about the 'no cksum' flag.
Thanks.

@zhangguanzhang
Copy link
Contributor

@jelaryma
After 63 seconds, a SYN packet is set with 'no cksum' and the connection is established.

63 seconds maybe refers to 5 times retransmission. But i'm still confused about the cause of this issue. So do you kown some articles or blog posts about the 'no cksum' flag.
Thanks.

https://zhangguanzhang.github.io/2020/05/23/k8s-vxlan-63-timeout/

@balleon
Copy link

balleon commented Jul 4, 2020

I have been hit by this bug as well. Does anyone know where to add the ethtool command to make it persistent after a reboot on Centos7? I tried adding it to rc.local but it looks like the device is being created after the script runs because I am getting a Cannot get device feature names: No such device error..

Did you found a solution for persistent fix after reboot?

@giordyb
Copy link

giordyb commented Jul 4, 2020

Did you found a solution for persistent fix after reboot?

@balleon Nope, thankfully the servers don't get rebooted very often...

@xiaods
Copy link

xiaods commented Jul 9, 2020

any better solution?

@balleon
Copy link

balleon commented Jul 9, 2020

any better solution?

On my Kubernetes 1.18.5 and CentOS 7 cluster i use a kube-proxy custom image.

FROM k8s.gcr.io/kube-proxy:v1.18.5
RUN rm -f /usr/sbin/iptables && clean-install iptables

@ctopher7
Copy link

Hello,

We have the exact same issue on CentOS 7 (3.10.0-1062.9.1.el7.x86_64) running Calico/Canal and flannel. After spending the better part of a week trying to figure out why two thirds of our cluster was unable to reliably to talk to the other third I stumbled upon this issue.

I can report that disabling offloading completely works around this issue. In our case, the command is: sudo ethtool --offload flannel.1 rx off tx off (because we are running flannel).

Your Environment

  • Calico version: v3.11
  • Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.17.1
  • Operating System and version: CentOS 7 (3.10.0-1062.9.1.el7.x86_64)

THIS WORKS!!!

@balleon
Copy link

balleon commented Aug 9, 2020

Last Kubernetes release shoud fix this issue.
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#changes-by-kind

Fixes a problem with 63-second or 1-second connection delays with some VXLAN-based network plugins which was first widely noticed in 1.16 (though some users saw it earlier than that, possibly only with specific network plugins). If you were previously using ethtool to disable checksum offload on your primary network interface, you should now be able to stop doing that. (#92035, @danwinship) [SIG Network and Node]

I tried it in the following environment:

  • CentOS 7.8 (3.10.0-1127)
  • Kubernetes 1.18.6
  • Calico 1.15.1 (VXLAN)

Problem still there, i have to sudo ethtool --offload vxlan.calico rx off tx off on all hosts to have a working cluster.
Did you found a fix that doesn't require to disable vxlan.calico interface offload?
This workaround isn't persistent after reboot so it can't be applied in a production environment.

@zhangguanzhang
Copy link
Contributor

Last Kubernetes release shoud fix this issue.
https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.18.md#changes-by-kind

Fixes a problem with 63-second or 1-second connection delays with some VXLAN-based network plugins which was first widely noticed in 1.16 (though some users saw it earlier than that, possibly only with specific network plugins). If you were previously using ethtool to disable checksum offload on your primary network interface, you should now be able to stop doing that. (#92035, @danwinship) [SIG Network and Node]

I tried it in the following environment:

  • CentOS 7.8 (3.10.0-1127)
  • Kubernetes 1.18.6
  • Calico 1.15.1 (VXLAN)

Problem still there, i have to sudo ethtool --offload vxlan.calico rx off tx off on all hosts to have a working cluster.
Did you found a fix that doesn't require to disable vxlan.calico interface offload?
This workaround isn't persistent after reboot so it can't be applied in a production environment.

@danwinship PTAL

@couloum
Copy link

couloum commented Nov 26, 2020

I have the same issue here, tested on Kubernetes v1.18.12 with Calico v3.17.

I tried the new option FELIX_FEATUREDETECTOVERRIDE="MASQFullyRandom=false", but there's still a random-fully generated by kube-proxy on a MASQUERADE rule.

With FELIX_FEATUREDETECTOVERRIDE="MASQFullyRandom=false"

# iptables -t nat -L -n | grep fully
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

Without FELIX_FEATUREDETECTOVERRIDE="MASQFullyRandom=false":

# iptables -t nat -L -n | grep fully
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* cali:e9dnSgSVNmIcpVhP */ ADDRTYPE match src-type !LOCAL limit-out ADDRTYPE match src-type LOCAL random-fully
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* cali:flqWnvo8yq4ULQLa */ match-set cali40masq-ipam-pools src ! match-set cali40all-ipam-pools dst random-fully

I was able to solve it by manually adding a MASQUERADE rule without random-fully option:

# dig +timeout=2 @10.96.0.10 google.com
; <<>> DiG 9.16.1-Ubuntu <<>> +timeout @10.96.0.10 google.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

# iptables -t nat -L KUBE-POSTROUTING -n -v
Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination
 1049 59949 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
   10   807 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
   10   807 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

# iptables -t nat -I KUBE-POSTROUTING 3 -j MASQUERADE

# iptables -t nat -L KUBE-POSTROUTING -n -v
Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination
 1064 60774 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
   10   807 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
    0     0 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0
   10   807 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

# dig +timeout=2 @10.96.0.10 google.com
; <<>> DiG 9.16.1-Ubuntu <<>> +timeout @10.96.0.10 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7434
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com.                    IN      A
;; ANSWER SECTION:
google.com.             30      IN      A       142.250.74.238
;; Query time: 0 msec
;; SERVER: 10.96.0.10#53(10.96.0.10)
;; WHEN: Thu Nov 26 11:05:49 UTC 2020
;; MSG SIZE  rcvd: 65

But that's not reboot proof :( We should have an option in kube-proxy to disable the random-fully option.

@couloum
Copy link

couloum commented Nov 26, 2020

If anyone is interested by a reboot-proof solution to apply the workarround (disable offloading on vxlan.calico), you can use these 2 files:

/etc/systemd/system/disable-offloading-on-vxlan.service

[Unit]
Description=Disable offloading on vxlan.calico network interface
Requires=kubelet.service
Documentation=https://github.com/projectcalico/calico/issues/3145

[Service]
Type=oneshot
ExecStart=/usr/local/bin/disable-offloading-on-vxlan
RemainAfterExit=yes
Restart=no
TimeoutStartSec=660

[Install]
WantedBy=multi-user.target

/usr/local/bin/disable-offloading-on-vxlan

#!/bin/bash

# This script apply a workarround for a bug encountered on Kubernetes with vxlan
# and iptables >= 1.6.2.
# You can find more details on this bug here:
# https://github.com/kubernetes/kubernetes/issues/96868
# https://github.com/projectcalico/calico/issues/3145
#
# The workarround is to disable offloading on vxlan interface

# Wait until an interface named vxlan.calico appear
# Wait a maximum of 10 minutes (= 60 checks every 10 seconds)

sleep_interval=10
max_retries=60
nb_tries=0
nic_name="vxlan.calico"

is_nic_available() {
  ip a show dev $nic_name > /dev/null 2>&1
}

deactivate_offloading() {
  ethtool --offload $nic_name rx off tx off
}

check_offloading() {
  # Return an error if at least one offload is enabled (rx or tx)
  if ethtool --show-offload $nic_name | grep -E '^.x-checksumming:' | grep -q  ': on'; then
    return 1
  else
    return 0
  fi
}

echo "Starting $(basename $0)"
echo "This will disable RX and TX offloading on network interface $nic_name"

while [[ $nb_tries -lt $max_retries ]]; do
  if is_nic_available; then
    echo "Network interface $nic_name found! Disabling offloading on it..."
    deactivate_offloading
    sleep 2
    if check_offloading; then
      echo "Offloading successfully disabled on interface $nic_name"
      exit 0
    else
      echo "Offloading has not been disabled correctly on interface $nic_name. Please check what happened"
      exit 2
    fi
  fi

  nb_tries=$((nb_tries + 1))

  echo "Network interface $nic_name does not exist yet. Waiting ${sleep_interval}s for it to appear (attempt $nb_tries/$max_retries)"

  sleep $sleep_interval
done

# If we are here, then we have timed out
echo "Exiting after $nb_tries attempts to detect interface $nic_name"
exit 1

@dcbw
Copy link

dcbw commented Dec 10, 2020

FWIW, the upstream kernel patch that fixed the issues we had root-caused for OpenShift and some Kubernetes use-cases is torvalds/linux@ea64d8d and is present in the 5.7 and later kernels, and the RHEL 8.2's kernel-4.18.0-193.13.2.el8_2 and later as of 2020-Jul-21. I presume CentOS 8.2 has this fix already.

Other distros (Ubuntu 20.04.1) may not yet have it, if they haven't updated their kernel or backported the patch.

@fasaxc
Copy link
Member

fasaxc commented Mar 31, 2021

FTR Calico does SNAT traffic from hosts to service VIPs that get load balanced to remote pods when in VXLAN (or IPIP mode). We do that because the source IP selection algorithm chooses the source IP based on the the route that appears to apply to the service VIP. The, after the source IP selection, the dest IP is changed by kube-proxy's DNAT. The source IP is then "wrong"; it'll be the eth0 IP, when it needs to be the Calico "tunnel IP" to reach a remote pod. Hence, we SNAT to the tunnel IP in that case.

@davesargrad
Copy link

davesargrad commented Apr 8, 2021

I am running a kubernetes 1.20 k8s cluster with calico. I just configured a 10 node cluster from baremetal about 2 weeks ago.

I have a simple application tier with a jellyfin streaming service and a restful application service (express+react). I see flaky communications between my react application and the jellyfin HLS service. These applications work and interact just fine outside of this kubernetes+calico environment.

Though its been a year since I had to worry about such things (since prior workarounds served me), the flaky behaviour reminds me of what I saw that led to the recognition of this 63 second delay.

These are the interfaces I see on one of my worker nodes:
image

I've been tracking this issue for a year now, since I was the author of a related k8s issue created when I spent over a week isolating such a network unreliability problem.

I am using Centos 7 with the latest calico and the latest kubernetes. I've seen some discussion (both in the k8s issue, and in the calico issue) about disabling tx offloading. I've seen the k8s issue closed (despite the fact that I very much see this as a k8s problem) and I've seen 3145 stay open (glad its still open, if the issue really still lurks and bites people).

  • Can someone please articulate the progress towards closing this issue properly?
  • Can someone please provide clear guidance on how to workaround this issue both on centos 7 clusters and on ubuntu clusters (for the benefit of those folks)? How should we disable TX Offloading? Which nodes should we update? Which interfaces need to have tx offloading disabled? How do we verify that it is disabled?
  • Is this issue really fixed in k8s 1.18 (as indicated above), or is it still something that I must be concerned with?
  • If I reconfigure my cluster, what operating system should I use? What version of calico should I use? What version of the kernel should I use? What version of kubernetes should I use? ... to avoid this nasty 63 second delay problem?

These are the kinds of painful issues that drive people away from powerful infrastructure elements like kubernetes and calico. I really, really, hope to see some well articulated guidance.

@fasaxc
Copy link
Member

fasaxc commented Apr 8, 2021

  • The easiest workaround is probably to switch to IPIP mode. I believe this issue is VXLAN-specific.
  • Failing that, you can upgrade to RH8.2 or equivalent CentOS
  • Failing that, upgrade to latest k8s and Calico and then set the env var FELIX_FEATUREDETECTOVERRIDE="MASQFullyRandom=false" in order to disable our use of --random-fully.

@caseydavenport
Copy link
Member

We can continue discussing, but I'm going to close this for now since the kernel has been patched, and Calico now provides a way to turn off the feature that triggers it.

@owenthomas17
Copy link

owenthomas17 commented Aug 3, 2021

I'm just updating this thread based on a recent Slack conversation with regards to a very similar problem with a valid workaround. We were seeing traffic fail after converting the overlay to VXLAN, specifically for UDP traffic when going from kubenode -> kubernetes service clusterIP, TCP seemed fine. @fasaxc recommended that we try Calico 3.20.0 and use a new feature detection override that's available, which I can confirm as making this traffic flow now work.

Details of the setup are

Kubernetes: 1.20.8
OS: Flatcar 2765.2.1
Kernel: 5.10.21
Calico: v3.20.0
Overlay Type: VXLAN

Example configuration used to apply the ChecksumOffloadBroken=true config:

kubectl get felixconfiguration default -o yaml
apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
  name: default
spec:
  bpfLogLevel: ""
  featureDetectOverride: ChecksumOffloadBroken=true
  logSeverityScreen: Info
  reportingInterval: 0s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet