Container control plane #1318

aerosouund · 2024-11-06T10:47:50Z

What this PR does / why we need it:

Runs the control plane of the kubevirtCI cluster in containers to discount on resource consumption, expects to reduce a runner per CI lane run across all repos
It is based on #1230

In a typical kubevirtCI cluster, the control plane is unschedulable. As seen in this snippet, only the system containers are on it

[vagrant@node01 ~]$ sudo kubectl --kubeconfig=/etc/kubernetes/admin.conf get pods -A -o wide | grep 101
kube-system   calico-node-xwvhg                          1/1     Running   0          3m47s   192.168.66.101   node01   <none>           <none>
kube-system   etcd-node01                                1/1     Running   1          4m1s    192.168.66.101   node01   <none>           <none>
kube-system   kube-apiserver-node01                      1/1     Running   1          4m1s    192.168.66.101   node01   <none>           <none>
kube-system   kube-controller-manager-node01             1/1     Running   1          4m1s    192.168.66.101   node01   <none>           <none>
kube-system   kube-proxy-qcwmn                           1/1     Running   0          3m47s   192.168.66.101   node01   <none>           <none>
kube-system   kube-scheduler-node01                      1/1     Running   1          4m2s    192.168.66.101   node01   <none>           <none>

Design

There is no standardized tool or technology that achieves what this PR tries to achieve. Atleast not in a way that matches the needs of kubevirtCI. One of the big requirements is that the control plane joining process remains the same for workers (through kubeadm) whether you are joining a VM control plane or a container control plane and so the PR provides its own way of provisioning certificates, running DNS in the cluster and many other rudimentary k8s concepts.

The code for the control plane container lives in cluster-provision/gocli/control-plane.
The creation of the control plane happens in a way similar to how kubeadm provisions a cluster, through the concept of phases. below is a snippet from its main function to illustrate how some of the phases are being called

	if err := NewCertsPhase(defaultPkiPath).Run(); err != nil {
		return nil, err
	}

	if err := NewRunETCDPhase(cp.dnsmasqID, cp.containerRuntime, defaultPkiPath).Run(); err != nil {
		return nil, err
	}

	if err := NewKubeConfigPhase(defaultPkiPath).Run(); err != nil {
		return nil, err
	}

	if err := NewRunControlPlaneComponentsPhase(cp.dnsmasqID, cp.containerRuntime, defaultPkiPath, cp.k8sVersion).Run(); err != nil {
		return nil, err
	}

The phases it runs are:

Certs: provisioning the cluster certificate authority and the certificates of individual components signed by this CA
ETCD: Runs etcd
Kubeconfig: Create the admin, controller manager & scheduler kubeconfig files
Bootstrappers RBAC: Create a bootstrap token secret and the necessary RBAC roles for the kubelet to register itself to the api
Bootstrap auth resources: Create important resources that kubeadm expects to find in the cluster during joining
Kube Proxy: Deploy Kube Proxy
CNI: Create the Calico CNI that would be previously created by node01
CoreDNS: Deploy CoreDNS

This then gets instantiated in the KV provider to start it

	if kp.Nodes > 1 {
		runner := controlplane.NewControlPlaneRunner(dnsmasq, strings.Split(kp.Version, "-")[1], uint(kp.APIServerPort))
		c, err = runner.Start()
		if err != nil {
			return err
		}
		k8sClient, err := k8s.NewDynamicClient(c)
		if err != nil {
			return err
		}
		kp.Client = k8sClient
	}

and node01 will now only get called if the node count is 1

	if nodeIdx == 1 && kp.Nodes == 1 {
		n := node01.NewNode01Provisioner(sshClient, kp.SingleStack, kp.NoEtcdFsync)

Changes to the networking setup

Only one change is required in dnsmasq.sh

  if [ ${NUM_NODES} -gt 1 ] && [ $i -eq 1 ]; then
    ip tuntap add dev tap101 mode tap user $(whoami)
    ip link set tap101 master br0
    ip link set dev tap101 up
    ip addr add 192.168.66.110/24 dev tap101
    ip -6 addr add fd00::110 dev tap101
    iptables -t nat -A PREROUTING -p tcp -i eth0 -m tcp --dport 6443 -j DNAT --to-destination 192.168.66.110:6443
  fi

If the node count is higher than 1 create an interface called tap101 and manually assign it the required IPs and forward port 6443 on it through eth0. Since the api server container gets launched in the same netns as dnsmasq all whats needed afterwards is that the server advertises the IP 192.168.66.110 as the api server endpoint. and this is taken care of in the code

Current state

The PR is under testing locally to see if all previously existing functionality is retained. As well as code cleaning for the final presentation. The PR is opened as a draft for the community to see it and give feedback on any points to help drive the direction of the project due to its size

Extra notes
cc: @dhiller @brianmcarey @acardace @xpivarc

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Design: A design document was considered and is present (link) or not required
PR: The PR description is expressive enough and will help future contributors
Code: Write code that humans can understand and Keep it simple
Refactor: You have left the code cleaner than you found it (Boy Scout Rule)
Upgrade: Impact of this change on upgrade flows was considered and addressed if required
Testing: New code requires new unit tests. New features and bug fixes require at least on e2e test
Documentation: A user-guide update was considered and is present (link) or not required. You want a user-guide update if it's a user facing feature / API change.
Community: Announcement to kubevirt-dev was considered

Release note:

Two new opts that represent the two scripts used in the provision phase (provision linux and provision k8s). Using go embed to include any necessary config files then run the commands on a node using libssh Signed-off-by: aerosouund <[email protected]>

The KubevirtProvider is a struct representing an arbitrary Kubevirtci running cluster. It holds all config flags and options that are in the run and provision commands. A Kubevirt provider can be created in two ways, by creating a cluster using the Start method, or from an already running cluster. For this to be possible then json representation of the struct is persisted on the dnsmasq container and later read to parse the deployed settings Or through the normal constructor which uses the option pattern to avoid a bloated function signature The logic that was previously in run.go has been split to several methods to facilitate readability and testing (runNFSGanesha, runRegistry, prepareQemuCmd, prepareDeviceMappings) and dnsmasq creation logic got moved to its own method instead of existing in its own package Floating methods such as waitForVMToBeUp, nodeNameFromIndex, nodeContainer.. etc were grouped to be methods of the struct Signed-off-by: aerosouund <[email protected]>

To avoid having to read each flag and return an error if its unset leverage the FlagMap, a map of flag name to FlagConfig. a FlagConfig is the type of this flag (string, int, uint16, bool or array of string) and the option function that sets the value of this flag on the KubevirtProvider struct. During parsing of flags this map is being iterated on and each option gets appended to an array to later be used in the KubevirtProvider constructor. The run method's role is now to parse the flags and pass them to the provider and just call Start. All the floating methods in run.go are removed after being moved to the provider. Signed-off-by: aerosouund <[email protected]>

Signed-off-by: aerosouund <[email protected]>

This functionality now exists in the KubevirtProvider type and doesn't need a package of its own Signed-off-by: aerosouund <[email protected]>

The KubevirtProvider type is what provides the methods that run a node or run the k8s options. Testing logic has been moved to a Base Provider Suite Signed-off-by: aerosouund <[email protected]>

Signed-off-by: aerosouund <[email protected]>

…uint16 All references to ports in the codebase use uint not uint16. There is no reason to keep the ports as they are Signed-off-by: aerosouund <[email protected]>

Signed-off-by: aerosouund <[email protected]>

kubevirt-bot · 2024-11-16T17:45:14Z

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

0bd7601 control plane stuff
c22baab control plane stuff
71edf3b use net parse ip

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

brianmcarey

Hi @aerosouund - thanks for all the effort on this.

I have a couple of concerns about this.

Overall, I am not convinced of the resources that this will save as we will just be running the kubernetes control plane components somewhere else.

We will still have a requirement to have a control plane node for our testing in kubevirt/kubevirt as a lot of the KubeVirt infra components require a control plane node to be scheduled - kubevirt/kubevirt#11659

These KubeVirt infra components can be sensitive to things like selinux policies and kernel modules so running in a controlled VM adds some benefits here.

We could make this container control plane configurable but I don't see it running in the main CI workloads and I am not sure how much it will be used in local dev environments as if a someone wants to test against a lightweight cluster setup, we have the kind cluster providers which provides a very lightweight cluster.

Let me know what you think. These are just a couple of the issues that came to my mind on this.

aerosouund · 2024-11-20T12:16:12Z

@brianmcarey

Valid concerns for sure, let me discuss them

Overall, I am not convinced of the resources that this will save as we will just be running the kubernetes control plane components somewhere else.

True that the control plane will run somewhere else, but you would not be reserving a big amount of compute for it. Currently, whatever amount you reserve for VMs in the cluster is what gets allocated to the control plane which can be rather big sometimes.

In general, the majority of resource saving is accrued from being given the ability to schedule workload pods (istio, cdi, multus.. etc) on any node in the cluster. Meaning that you have unlocked the total sum of resources used by those containers to be allocatable on node 1 which was previously the control plane.

We will still have a requirement to have a control plane node for our testing in kubevirt/kubevirt as a lot of the KubeVirt infra components require a control plane node to be scheduled - kubevirt/kubevirt#11659

Yes, i ran into this issue while testing this PR. and for now the current hacky fix i am using is by labeling a random node as the control plane even though it isn't and that seems to be enough to make it work.

But if we check the reasons mentioned in the PR, they say its because

if you take over virt-controller, you can create a pod in every namspace with an image of your choosing, mounting any secret you like and dumping it.
if you take over virt-operator you can directly create privileged pods
if you take over virt-api you can inject via modifying webhooks your own malicius kernel, image

I am still reading their full rationale behind this, but based on what they say these risks will be present wherever the components are scheduled.
And also to my knowledge (and correct me if i am wrong), we aren't providing any additional security hardening on the control plane node. So the problems are present in all cases.

And an opinionated take i have on this is that KubeVirtCI is an ephemeral cluster creator and not meant for any long lived clusters. Also, clusters it creates run on isolated environments (atleast in CI thats how it is) . And so in terms of priorities CI efficiency and resources beats security

These KubeVirt infra components can be sensitive to things like selinux policies and kernel modules so running in a controlled VM adds some benefits here.

If i understand correctly you are saying KV infra components are best suited for being ran on a VM. Well, under this PR they still are.
What has changed is that the control plane components are running elsewhere and all nodes are being labeled as workers (which frees up node01)

We could make this container control plane configurable but I don't see it running in the main CI workloads and I am not sure how much it will be used in local dev environments as if a someone wants to test against a lightweight cluster setup, we have the kind cluster providers which provides a very lightweight cluster.

It has challenges for sure, but i believe that it has very strong candidacy to run in CI and no challenge (so far) seems so glaring to imply its impossible to run it in CI. The latest of them being the timing out of validation webhooks due to the api server not being in the pod network. This was overcame by using konnectivity.

Happy to discuss this further. Let me know if you have any opinions on what i said or if anything i said needs correction

brianmcarey · 2024-11-21T16:02:48Z

True that the control plane will run somewhere else, but you would not be reserving a big amount of compute for it. Currently, whatever amount you reserve for VMs in the cluster is what gets allocated to the control plane which can be rather big sometimes.

In general, the majority of resource saving is accrued from being given the ability to schedule workload pods (istio, cdi, multus.. etc) on any node in the cluster. Meaning that you have unlocked the total sum of resources used by those containers to be allocatable on node 1 which was previously the control plane.

We do use the control plane node (node01) for scheduling test workloads so I am not sure if the resources are wasted. node01 is schedulable. For instance here you can see a test VM on node01 - https://storage.googleapis.com/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/13208/pull-kubevirt-e2e-k8s-1.31-sig-compute/1859000686096683008/artifacts/k8s-reporter/3/1_overview.log

What would the benefit of this approach be over just running a single node kubevirtci cluster? What kind of resource savings are seeing by moving the kubeernetes control components out of the VM? I don't think does components are that heavy on resources but maybe I am wrong.

aerosouund · 2024-11-22T10:24:16Z

@brianmcarey

We do use the control plane node (node01) for scheduling test workloads so I am not sure if the resources are wasted. node01 is schedulable. For instance here you can see a test VM on node01 - https://storage.googleapis.com/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/13208/pull-kubevirt-e2e-k8s-1.31-sig-compute/1859000686096683008/artifacts/k8s-reporter/3/1_overview.log

Based on this it seems that some components are indeed scheduled on the control plane, I need to investigate per component why that is as some components actively check for the control plane label on the node to be scheduled on them (kubevirt CR job for example), but by default the control plane is not taking any pods on it.
My take is at tests with sufficient scale those components (KubeVirtCI components) can indeed take up sizable resources. I can try to provide hard numbers for this

What would the benefit of this approach be over just running a single node kubevirtci cluster? What kind of resource savings are seeing by moving the kubeernetes control components out of the VM? I don't think does components are that heavy on resources but maybe I am wrong.

You are right in saying that moving the control plane components isn't gonna result in high savings as they aren't the main culprit in resource consumption. The benefit is that by taking them out, all your nodes are now workers and you can treat them all as a shared resource pool, rather than giving a particular node special treatment. This means that the kubevirtci components don't need to take into account scheduling laws or things like that and can sit on any node. Effectively achieving better resource utilization and savings across the cluster.

In general, the sensible way for move forward with this project is to:

1- Get it working for all KubeVirtCI test cases and lanes
2- See if we can actually start downsizing some testing workloads (by changing the job definition) and if we see improvements we can have this be our new standard

Let me know what you think

aerosouund added 30 commits October 5, 2024 11:41

refactor: Rewrite provision.go to use the KubevirtProvider

79b1d7a

Signed-off-by: aerosouund <[email protected]>

refactor: Delete the nodesconfig package

29a2bc4

This functionality now exists in the KubevirtProvider type and doesn't need a package of its own Signed-off-by: aerosouund <[email protected]>

testing: Move testing logic to the providers package

9b07a17

The KubevirtProvider type is what provides the methods that run a node or run the k8s options. Testing logic has been moved to a Base Provider Suite Signed-off-by: aerosouund <[email protected]>

feat: Introduce retries in ssh connection in the ssh client

d0f4bca

Signed-off-by: aerosouund <[email protected]>

refactor: Change the return type of GetPublicPort to uint instead of …

7cfb650

…uint16 All references to ports in the codebase use uint not uint16. There is no reason to keep the ports as they are Signed-off-by: aerosouund <[email protected]>

feat: CRI package

c68122c

Signed-off-by: aerosouund <[email protected]>

control plane stuff

0bd7601

control plane stuff

c22baab

more control plane stuff

d483708

Signed-off-by: aerosouund <[email protected]>

networking

8f01693

Signed-off-by: aerosouund <[email protected]>

use net parse ip

71edf3b

minor fixes

9ebcf45

Signed-off-by: aerosouund <[email protected]>

k8s version

24cda57

Signed-off-by: aerosouund <[email protected]>

use versionmap

fdb01e6

Signed-off-by: aerosouund <[email protected]>

use versionmap

ffc6026

Signed-off-by: aerosouund <[email protected]>

control plane fixes

b4b6c2d

Signed-off-by: aerosouund <[email protected]>

add apiserver port parameter

cab837c

Signed-off-by: aerosouund <[email protected]>

fix registry container

9b64a55

Signed-off-by: aerosouund <[email protected]>

no newline

5f1cf63

Signed-off-by: aerosouund <[email protected]>

use 110 as the container control plane ip

9ad1c14

Signed-off-by: aerosouund <[email protected]>

forward port 6443 to the container control plane

cad2f5a

Signed-off-by: aerosouund <[email protected]>

join the control plane container

64d8ecc

Signed-off-by: aerosouund <[email protected]>

scheduler

de88a87

Signed-off-by: aerosouund <[email protected]>

exec sleep

8488868

Signed-off-by: aerosouund <[email protected]>

hack

c15a345

Signed-off-by: aerosouund <[email protected]>

bash arithmetic

c1387fc

Signed-off-by: aerosouund <[email protected]>

use other iface

ddb41dc

Signed-off-by: aerosouund <[email protected]>

kubevirt-bot added the size/XXL label Nov 6, 2024

aerosouund added 24 commits November 6, 2024 13:23

create local volume opt

5c853b4

Signed-off-by: aerosouund <[email protected]>

add prefix to the control plane container

5fdd8c9

Signed-off-by: aerosouund <[email protected]>

add prefix

e3abee4

Signed-off-by: aerosouund <[email protected]>

runtime wont be available in the container

a5fc4df

Signed-off-by: aerosouund <[email protected]>

mount runtime

f8f4de6

Signed-off-by: aerosouund <[email protected]>

mount runtime

48bdb9a

Signed-off-by: aerosouund <[email protected]>

mount runtime

63d6f33

Signed-off-by: aerosouund <[email protected]>

debug

3387931

Signed-off-by: aerosouund <[email protected]>

mount binaries

d5664b3

Signed-off-by: aerosouund <[email protected]>

use default name

1f4d649

Signed-off-by: aerosouund <[email protected]>

kwok opt

68f5014

Signed-off-by: aerosouund <[email protected]>

authenticate the api server

1321a81

Signed-off-by: aerosouund <[email protected]>

add fake control plane label

500b641

Signed-off-by: aerosouund <[email protected]>

label node correctly

8ffbf2b

Signed-off-by: aerosouund <[email protected]>

request header

7fdcb41

Signed-off-by: aerosouund <[email protected]>

different iptables

e09e1e7

Signed-off-by: aerosouund <[email protected]>

kubeconfig

2adba05

Signed-off-by: aerosouund <[email protected]>

konnectivity

1216e70

Signed-off-by: aerosouund <[email protected]>

crt

a0024d8

Signed-off-by: aerosouund <[email protected]>

port 0

0f71564

Signed-off-by: aerosouund <[email protected]>

agent

f38f943

Signed-off-by: aerosouund <[email protected]>

already exists

c8e9edb

Signed-off-by: aerosouund <[email protected]>

mount etc

3849806

Signed-off-by: aerosouund <[email protected]>

dont even remember why i did this

e3c68c7

Signed-off-by: aerosouund <[email protected]>

brianmcarey reviewed Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container control plane #1318

Container control plane #1318

aerosouund commented Nov 6, 2024 •

edited

Loading

kubevirt-bot commented Nov 16, 2024

brianmcarey left a comment

aerosouund commented Nov 20, 2024

brianmcarey commented Nov 21, 2024

aerosouund commented Nov 22, 2024 •

edited

Loading

Container control plane #1318

Are you sure you want to change the base?

Container control plane #1318

Conversation

aerosouund commented Nov 6, 2024 • edited Loading

kubevirt-bot commented Nov 16, 2024

brianmcarey left a comment

Choose a reason for hiding this comment

aerosouund commented Nov 20, 2024

brianmcarey commented Nov 21, 2024

aerosouund commented Nov 22, 2024 • edited Loading

aerosouund commented Nov 6, 2024 •

edited

Loading

aerosouund commented Nov 22, 2024 •

edited

Loading