Cluster fails to survive reboot - Invalid IP assigned to node #462

Turtlez32 · 2024-03-01T05:24:21Z

Turtlez32
Mar 1, 2024

I have 2 clusters
Production - 6 node (3 etcd, 3 worker)
Development - 3 node (3 etcd).

Production is cloud-init backed Ubuntu 22.04 machines with static ip's set in cloud init. When building the cluster the first time its online and runs without issues. When I get a power outage or reboot. all nodes come back online but cluster is not available.

On reboot of development my nodes get ip's
1 - 10.0.99.104/24
2 - 10.0.99.105/24 + 10.0.99.104/32
3 - 10.0.99.106/24

example ip a on the node which kills the cluster

ip a show eno1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 00:23:24:9a:99:c4 brd ff:ff:ff:ff:ff:ff
    altname enp0s25
    inet 10.0.99.105/24 brd 10.0.99.255 scope global eno1
       valid_lft forever preferred_lft forever
    inet 10.0.99.104/32 scope global eno1
       valid_lft forever preferred_lft forever
    inet6 fe80::223:24ff:fe9a:99c4/64 scope link
       valid_lft forever preferred_lft forever

Expected Behavior

Cluster starts up without issues

Current Behavior

Cluster is unable to start, restart required on all non first node to trigger IP move.

Steps to Reproduce

Build a new k3s cluster based on the playbook
Define static IP address for each node (Cloud-init/Netplan)
Reboot cluster nodes
check ip a (second node gets first nodes IP with /32 subnet

Context (variables)

Operating system: Ubuntu 22.04 | Debian 12
Hardware: Lenovo Tiny m900 (Production) | Lenovo Tiny M703 (Ubuntu 22.04 Server)

Variables Used

all.yml

k3s_version: v1.29.0+k3s1
systemd_dir: /etc/systemd/system
system_timezone: "Australia/Sydney"
flannel_iface: "eth0"
apiserver_endpoint: "10.0.99.104"
k3s_node_ip: '{{ ansible_facts[flannel_iface]["ipv4"]["address"] }}'
k3s_master_taint: "{{ true if groups['node'] | default([]) | length >= 1 else false }}"

kube_vip_tag_version: "v0.6.4"
metal_lb_type: "native"
metal_lb_mode: "layer2"
metal_lb_speaker_tag_version: "v0.13.12"
metal_lb_controller_tag_version: "v0.13.12"
metal_lb_ip_range: "10.0.99.110-10.0.99.115"

Hosts

host.ini

[master]
10.0.99.104
10.0.99.105
10.0.99.106

[k3s_cluster:children]
master

Possible Solution

I am wondering if this is a cloud-init | proxmox | k3s issue. I am only seeing this issue on nodes 1/2 of my clusters. It started happening about 2 months ago when I was using Debian 12, saw there was a cloud-init bug about ip's. Switched to Ubuntu 22.04 and seeing the same issues.

Terraform VM clone config

resource "proxmox_vm_qemu" "k3s_node_1" {
  target_node = "pve-cni-1"
  name        = "pkc-nde-1"
  desc        = "k3s PRD For Turtleware"
  onboot      = true
  full_clone  = true
  boot        = "order=ide2;scsi0;net0;ide0"
  clone       = "ubuntu"
  agent       = 0
  cores       = 2
  sockets     = 1
  cpu         = "host"
  memory      = 4096
  ipconfig0   = "gw=10.0.99.1,ip=10.0.99.2/24"
  scsihw      = "virtio-scsi-single" 
  qemu_os     = "l26"
  ciuser      = var.ciuser
  cipassword  = var.cipassword
  sshkeys     = var.sshkeys
  vga {
    type = "serial0"
  }
  network {
    bridge = "vmbr0"
    model  = "virtio"
  }
  os_type = "ubuntu"
}

resource "dns_a_record_set" "k3s_node_1" {
  zone      = "turtleware.au."
  name      = "pkc-nde-01"
  addresses = ["10.0.99.2"]
  ttl       = 300
}

Extra Testing 6 March 2024

I have set DHCP leasing to start at 10.0.99.50 10.0.99.90 however that still does not solve the issue.

I have had some success with the following

Adding the following line to the cloud-init config on my secondary node moved the ip address issue to the third node. I don't see this as being a viable option but maybe it will help.

network: {config: disabled}

timothystewart6 · 2024-03-01T14:09:04Z

timothystewart6
Mar 1, 2024
Maintainer

This sounds like a cloud init template issue. This playbook doesn't do anything to the machine's networking. I've seen similar things when machines are cloned but not cleaned.

1 reply

Turtlez32 Mar 1, 2024
Author

Thanks Tim, that's what I have been thinking the fact it's now on both Debian and Ubuntu it sounds like either cloudinit or proxmox. I will see if I can find ways to clone differently.

Not sure if it will help but I use terraform to build the vm. I will grab my config and post it.

The other thing is also getting this issue on physical PC's which I installed manually from Ubuntu 22.4.3 iso image.

Turtlez32 · 2024-03-06T04:02:47Z

Turtlez32
Mar 6, 2024
Author

So far I have had success adding the following to my netplan configs.
Adding renderer: networkd
Adding dhcp4: no

seems weird that setting these values on a pysical install were I set static IP that it still uses DHCP. Note this issue happened when I manually run through the ubuntu 22.04 live install media, and select manual for ip settings.

nano /etc/netplan/<config>.yaml

network:
  version: 2
  renderer: networkd
  ethernets:
    eth0:
      dhcp4: no
      addresses:
      - 10.0.99.4/24
      match:
        macaddress: 26:e6:21:2a:6a:80
      nameservers:
        addresses:
        - 10.0.44.50
        - 10.0.44.52
        search:
        - turtleware.au
      routes:
      -   to: default
          via: 10.0.99.1
      set-name: eth0

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster fails to survive reboot - Invalid IP assigned to node #462

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Cluster fails to survive reboot - Invalid IP assigned to node #462

Turtlez32 Mar 1, 2024

Expected Behavior

Current Behavior

Steps to Reproduce

Context (variables)

Variables Used

Hosts

Possible Solution

Terraform VM clone config

Extra Testing 6 March 2024

Replies: 2 comments · 1 reply

timothystewart6 Mar 1, 2024 Maintainer

Turtlez32 Mar 1, 2024 Author

Turtlez32 Mar 6, 2024 Author

Turtlez32
Mar 1, 2024

Replies: 2 comments 1 reply

timothystewart6
Mar 1, 2024
Maintainer

Turtlez32 Mar 1, 2024
Author

Turtlez32
Mar 6, 2024
Author