Flatcar 3975.2.1 Bonding Config Bug #1580

DFYT42 · 2024-11-13T22:29:05Z

Description

We've encountered a problem with bonding configs., after our most recent Flatcar upgrade from v3760.2.0 to v3975.2.1. The behavior is very weird, in that the bond0 interface actor churn does not always begin after the initial upgrade reboot. Instead, the bond0 interface actor churn most frequently appears after a subsequent reboot.

We can commonly recover from this by rebooting, but that does not always fix it
We have tried downing and reupping the effected Bond0 interface but that doesn't seem to have any effect
We tried to upgrade to the next known stable, 3975.2.2, but we see the same problem
We tried downgrading to v3760.2.0 and that worked-- the interface no longer enters churn
We then tried upgrading back to 3975.2.1, rebooting after the upgrade reboot, and churn reappeared

Impact

Nodes, rebooted after the initial upgrade reboot, go into churn on the secondary Bond0 interface and are subsequently unable to communicate with other nodes in the cluster.

Environment and steps to reproduce

Set-up: [ describe the environment Flatcar/Nebraska etc was running in when encountering the bug; Platform etc. ]
a. Baremetal Flatcar OS 3760.2.0 upgraded via Nebraska to Flatcar OS 3975.2.1
Task: [ describe the task performing when encountering the bug ]
a. After the node is upgraded and rebooted, the node is then rebooted a second time, and churn appears, which causes lag during node login and commands being run
Action(s): [ sequence of actions that triggered the bug, see example below ]
a. Rebooted the node, after the initial upgrade reboot
b. Node login and commands begin to hang and take many seconds to minutes to complete
c. /proc/net/bonding/bond0 shows churn on the secondary interface, and has no system mac address present
Error: [describe the error that was triggered]
a. Nodes were unable to communicate with effected node

Expected behavior

Expect nodes to communicate with other nodes in the cluster

Additional information

Please add any information here that does not fit the above format.

DFYT42 · 2024-11-14T19:45:09Z

We were asked for the following:

underlying hardware (network devices / virt environments etc.)
- Bare metal
- Problematic Flatcar OS 3975.2.1 & 3975.2.2
- Working Flatcar OS 3760.2.0
- Systemd 252
- Kubernetes versions 1.28.x to 1.31.x
- Server: Dell R6515
- Switch: Juniper EX4300-48T
dmesg

     [   18.646307] ice 0000:41:00.1 enp65s0f1np1: NIC Link is up 25 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: FC-FEC/BASE-R, Autoneg Advertised: On, Autoneg Negotiated: False, Flow Control: None
[   18.666616] bond0: (slave enp65s0f1np1): Enslaving as a backup interface with an up link
[   18.675470] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
[   18.686452] ice 0000:41:00.1 enp65s0f1np1: Error ADDING CP rule for fail-over
[   18.693782] ice 0000:41:00.1 enp65s0f1np1: Shared SR-IOV resources in bond are active
[   18.702648] ice 0000:41:00.0: Primary interface not in switchdev mode - VF LAG disabled

We were asked to try the following but are still seeing issues:

Create /etc/systemd/network/98-bond-mac.link
- Add the following to the newly created /etc/systemd/network/98-bond-mac.link

[Match]
Type=bond

[Link]
MACAddressPolicy=none

jepio · 2024-11-15T09:45:54Z

Can you try the alpha releases between 3760 and 3975, this would help narrow it down:

3794
3815
3850 -> first with kernel 6.6
3874
3913 -> first with systemd 255
3941

DFYT42 · 2024-11-16T00:16:12Z

@jepio

Please see below upgrade process and results:

3975.2.1
- Reboot X 3
- Have churn on all 3 reboots
  - No system mac address on 2nd Bond0 interface during all 3 reboots
- Downgraded to 3760.2.0
3760.2.0
- Reboot X 3
- No churn on all three reboots
  - Has system mac address on 2nd Bond0 interface during all 3 reboots
- Upgraded to 3794.0.0
3794.0.0
- Reboot X 3
- No churn on all three reboots
  - Has system mac address on 2nd Bond0 interface during all 3 reboots
- Upgraded to 3815.0.0
3815.0.0
- Reboot X 3
- No churn on all three reboots
  - Has system mac address on 2nd Bond0 interface during all 3 reboots
- Upgraded to 3850.0.0

Please note the following is the first presence of `churn`

3850.0.0
- Reboot X 3
- Have churn on 1st reboot
  - No system mac address on 2nd Bond0 interface during 1st reboot
- No churn on 2nd reboot
  - Has system mac address on 2nd Bond0 interface during 2nd reboot
- Have churn on 3rd reboot
  - No system mac address on 2nd Bond0 interface during 3rd reboot
- Upgraded to 3874.0.0
3874.0.0
- Reboot X 3
- Have churn on 1st reboot
  - No system mac address on 2nd Bond0 interface during 1st reboot
- Have churn on 2nd reboot
  - No system mac address on 2nd Bond0 interface during 2nd reboot
- No churn on 3rd reboot
  - Has system mac address on 2nd Bond0 interface during 3rd reboot
- Upgraded to 3913.0.0

Please note the following has no presence of `churn`

Upgraded to 3913.0.0
- Reboot X 3
- No churn on all three reboots
  - Has system mac address on 2nd Bond0 interface during all 3 reboots
- Upgraded to 3941.0.0

Please note `churn` returns in the following

Upgraded to 3941.0.0
- Reboot X 3
- systemd 255
- No churn on first 2 reboots
  - Has system mac address on 2nd Bond0 interface during first 2 reboots
- Have churn on 3rd reboot
  - No system mac address on 2nd Bond0 interface during 3rd reboot
- Upgraded to 3975.0.0
Upgraded to 3975.0.0
- Reboot X 3
- systemd 255
- No churn on first 2 reboots
  - Has system mac address on 2nd Bond0 interface during first 2 reboots
- Have churn on 3rd reboot
  - No system mac address on 2nd Bond0 interface during 3rd reboot
- Upgraded to 3975.2.1
Upgraded to 3975.2.1
- Reboot X 3
- systemd 255
- Have churn on 1st reboot
  - No system mac address on 2nd Bond0 interface during 1st reboot
- No churn on 2nd reboot
  - Has system mac address on 2nd Bond0 interface during 2nd reboot
- Have churn on 3rd reboot
  - No system mac address on 2nd Bond0 interface during 3rd reboot

ader1990 · 2024-11-18T10:11:32Z

Hello, this looks to be a concurrency issue between the unit that enforces/creates the bond and the unit that enforces the /etc/systemd/network/98-bond-mac.link. Can you give more details, if possible, on how the bonds are configured - is it a butane/ignition config or another configuration file/agent? This would be valuable to reproduce the issue locally.

Also, is it possible to maybe try a version of Flatcar with a different kernel / systemd to see if the issue does still happen (you can find a Flatcar image artifact here with kernel 6.11 https://github.com/flatcar/scripts/actions/runs/11594744048 and one Flatcar image artifact here with systemd 256 https://github.com/flatcar/scripts/actions/runs/11557455799). My bet would be on a different systemd version.

Thanks.

jepio · 2024-11-18T10:56:38Z

Can we also compare networkctl list and networkctl status between a working and a broken version?

DFYT42 · 2024-12-03T20:55:21Z

@jepio

Please see the information you requested below:

3975.2.1

networkctl list

IDX LINK         TYPE     OPERATIONAL SETUP
  1 lo           loopback carrier     unmanaged
  2 enp65s0f0np0 ether    enslaved    configured
  3 enp65s0f1np1 ether    enslaved    configured
  4 bond0        bond     routable    configured

networkctl status

Interfaces: 1, 2, 3, 4
       State: routable
Online state: online
     Address: x.x.x.x on bond0
              x.x.x.x on bond0
              x:x:x:x on bond0
              x:x:x:x on bond0
     Gateway: x.x.x.x on bond0
              x:x:x:x on bond0
         DNS: x.x.x.x
              x.x.x.x

Dec 03 20:23:17 systemd[1]: Starting systemd-networkd-wait-online.service - Wait for Network to be Configured...
Dec 03 20:23:17  systemd-networkd[1990]: bond0: Configuring with /etc/systemd/network/20-bond0.network.
Dec 03 20:23:17  systemd-networkd[1990]: enp65s0f0np0: Link UP
Dec 03 20:23:17  systemd-networkd[1990]: enp65s0f1np1: Link UP
Dec 03 20:23:17  systemd-networkd[1990]: bond0: Link UP
Dec 03 20:23:17  systemd-networkd[1990]: enp65s0f0np0: Gained carrier
Dec 03 20:23:17  systemd-networkd[1990]: enp65s0f1np1: Gained carrier
Dec 03 20:23:17  systemd-networkd[1990]: bond0: Gained carrier
Dec 03 20:23:17  systemd[1]: Finished systemd-networkd-wait-online.service - Wait for Network to be Configured.
Dec 03 20:23:19  systemd-networkd[1990]: bond0: Gained IPv6LL

3760.2.0

networkctl list

IDX LINK   TYPE     OPERATIONAL SETUP
  1 lo     loopback carrier     unmanaged
  2 ens3f0 ether    enslaved    configured
  3 ens3f1 ether    enslaved    configured
  4 bond0  bond     routable    configured

networkctl status

State: routable
  Online state: online
       Address: x.x.x.x on bond0
                x.x.x.x on bond0
                x:x:x:x on bond0
                x:x:x:x on bond0
       Gateway: x.x.x.x on bond0
                x:x:x:x on bond0
           DNS: x.x.x.x
                x.x.x.x

Dec 03 20:50:49  systemd[1]: Starting systemd-networkd-wait-online.service - Wait for Network to be Configured...
Dec 03 20:50:49  systemd-networkd[1714]: ens3f0: Configuring with /etc/systemd/network/00-nic.network.
Dec 03 20:50:49  systemd-networkd[1714]: bond0: Link UP
Dec 03 20:50:49  systemd-networkd[1714]: ens3f1: Link UP
Dec 03 20:50:49  systemd-networkd[1714]: ens3f0: Link UP
Dec 03 20:50:49  systemd-networkd[1714]: ens3f1: Gained carrier
Dec 03 20:50:49  systemd-networkd[1714]: bond0: Gained carrier
Dec 03 20:50:49  systemd-networkd[1714]: ens3f0: Gained carrier
Dec 03 20:50:49  systemd[1]: Finished systemd-networkd-wait-online.service - Wait for Network to be Configured.
Dec 03 20:50:51  systemd-networkd[1714]: bond0: Gained IPv6LL

DFYT42 · 2024-12-09T22:19:53Z

@ader1990

First Question

Hello, this looks to be a concurrency issue between the unit that enforces/creates the bond and the unit that enforces the /etc/systemd/network/98-bond-mac.link. Can you give more details, if possible, on how the bonds are configured - is it a butane/ignition config or another configuration file/agent? This would be valuable to reproduce the issue locally.

We use an ignition file. Here is the netword part of that file:

networkd:
  units:
    - name: 00-nic.network
      contents: |
        [Match]
        Name=!bond0
        MACAddress={{.mac1}} {{.mac2}} {{.mac_add}}

        [Network]
        Bond=bond0
    - name: 10-bond0.netdev
      contents: |
        [NetDev]
        Name=bond0
        Kind=bond
        MACAddress={{.mac1}}

        [Bond]
        TransmitHashPolicy=layer3+4
        MIIMonitorSec=.1
        UpDelaySec=.2
        DownDelaySec=.2
        Mode=802.3ad
        LACPTransmitRate=fast
    - name: 20-bond0.network
      contents: |
        [Match]
        Name=bond0

        [Network]
        DNS={{ .dns1 }}
        DNS={{ .dns2 }}

        [Address]
        Address={{.public_ip4}}

        [Address]
        Address={{.public_ip6}}

        [Address]
        Address={{.private_ip4}}

        [Route]
        Destination=x.x.x.x/x
        Gateway={{.public_gw4}}

        [Route]
        Destination=x:x:x/x
        Gateway={{.public_gw6}}

        [Route]
        Destination=x.x.x.x/x
        Gateway={{.private_gw4}}

Second Question

Also, is it possible to maybe try a version of Flatcar with a different kernel / systemd to see if the issue does still happen (you can find a Flatcar image artifact here with kernel 6.11 https://github.com/flatcar/scripts/actions/runs/11594744048 and one Flatcar image artifact here with systemd 256 https://github.com/flatcar/scripts/actions/runs/11557455799). My bet would be on a different systemd version.

I couldn't find a link to a release specifically
What do I need to do?

ader1990 · 2024-12-18T09:16:33Z

Hello @DFYT42,

Can you try the Flatcar image from the artifacts tab produced by this github action run: https://github.com/flatcar/scripts/actions/runs/12375552738?pr=2145 ?

This image has a newer version of systemd (v256).

Thank you,
Adrian.

tylerauerbeck · 2025-01-08T17:06:01Z

@ader1990 Sorry for the delayed response. It looks like over the holiday break these artifacts may have expired. Can you let us know if there's somewhere else we can grab the necessary artifacts? Or do you have any idea if this will be in one of the upcoming alpha releases that we'd be able to test out?

Or separately, if there's somewhere we can point the installer to to grab this beforehand, we'd be happy to test from there as well.

sayanchowdhury · 2025-01-10T14:06:42Z

@tylerauerbeck Yes, systemd v256 will be included in the alpha release next week. You could use the artifacts from the release to test or you could look into the images from the nightlies

DFYT42 added the kind/bug Something isn't working label Nov 13, 2024

github-project-automation bot added this to Flatcar tactical, release planning, and roadmap Nov 13, 2024

github-project-automation bot moved this to 📝 Needs Triage in Flatcar tactical, release planning, and roadmap Nov 13, 2024

github-actions bot mentioned this issue Nov 22, 2024

Monthly contributions report 2024-10-22 - 2024-11-21 #1585

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flatcar 3975.2.1 Bonding Config Bug #1580

Flatcar 3975.2.1 Bonding Config Bug #1580

DFYT42 commented Nov 13, 2024

DFYT42 commented Nov 14, 2024

jepio commented Nov 15, 2024

DFYT42 commented Nov 16, 2024

ader1990 commented Nov 18, 2024

jepio commented Nov 18, 2024

DFYT42 commented Dec 3, 2024

DFYT42 commented Dec 9, 2024

ader1990 commented Dec 18, 2024

tylerauerbeck commented Jan 8, 2025

sayanchowdhury commented Jan 10, 2025

Flatcar 3975.2.1 Bonding Config Bug #1580

Flatcar 3975.2.1 Bonding Config Bug #1580

Comments

DFYT42 commented Nov 13, 2024

Description

Impact

Environment and steps to reproduce

Expected behavior

Additional information

DFYT42 commented Nov 14, 2024

We were asked for the following:

We were asked to try the following but are still seeing issues:

jepio commented Nov 15, 2024

DFYT42 commented Nov 16, 2024

Please see below upgrade process and results:

Please note the following is the first presence of churn

Please note the following has no presence of churn

Please note churn returns in the following

ader1990 commented Nov 18, 2024

jepio commented Nov 18, 2024

DFYT42 commented Dec 3, 2024

3975.2.1

3760.2.0

DFYT42 commented Dec 9, 2024

First Question

Second Question

ader1990 commented Dec 18, 2024

tylerauerbeck commented Jan 8, 2025

sayanchowdhury commented Jan 10, 2025

Please note the following is the first presence of `churn`

Please note the following has no presence of `churn`

Please note `churn` returns in the following