Add cgroup support #734

WeiZhang555 · 2018-09-15T07:10:16Z

cgroups: add host cgroup support

Fixes #344

Add host cgroup support for kata.

This commits only adds cpu.cfs_period and cpu.cfs_quota support.

It will create 3-level hierarchy, take "cpu" cgroup as an example:

/sys/fs/cgroup
|---cpu
   |---vc
      |---<sandbox-id>
         |--vcpu
      |---<sandbox-id>

vc cgroup is common parent for all kata-container sandbox, it won't be removed
after sandbox removed. This cgroup has no limitation.
<sandbox-id> cgroup is the layer for each sandbox, it contains all other qemu
threads except for vcpu threads. In future, we can consider putting all shim
processes and proxy process here. This cgroup has no limitation yet.
vcpu cgroup contains vcpu threads from qemu. Currently cpu quota and period
constraint applies to this cgroup.

Signed-off-by: Wei Zhang [email protected]
Signed-off-by: Jingxiao Lu [email protected]

WeiZhang555 · 2018-09-15T07:13:22Z

Replace #416

WeiZhang555 · 2018-09-15T07:27:50Z

Note 1:

Currently I imported "github.com/WeiZhang555/cgroups", this repo is actually a fork of master branch of github.com/containerd/cgroups. The reason I made a fork instead of original code is:

latest code of github.com/containerd/cgroups uses a higher version of runtime-spec, which contains new RDMA cgroup support: https://github.com/containerd/cgroups/blob/master/rdma.go, importing this repo requires we bump runtime-spec for both kata-runtime and kata-agent.
latest code has bug according to test, raised a PR to fix: Bugfix: can't write to cpuset cgroup containerd/cgroups#54
another way is we use an older version of github.com/containerd/cgroups without RDMA codes, according to my test, the latest code without RDMA support also misses an important function: AddTask(), without this, I can't do fine-granularity resource limitation.

Combining the issues above, I will suggest:
** 1. we fork latest code from github.com/containerd/cgroups under kata-containers, but removes RDMA codes, including the bugfix I mentioned, then this PR can vendor github.com/kata-containers/cgroups instead of github.com/WeiZhang555/cgroups.**

Note 2:

govmm needs a vendor, will do it once this PR looks good

WeiZhang555 · 2018-09-15T08:08:21Z

Because there's cgroup support in kata-agent, so we need to do some trick to see if this really works.

test steps with docker:

# docker run --rm -it --cpu-quota 60000 --cpu-period 100000 --runtime kata progrium/stress --cpu 4  --timeout 600s

Top on host, you can see the qemu process takes about 60% cpu usage.

then enter /sys/fs/cgroup/cpu/vc/<container-id>/vcpu, modify content of cpu.cfs_quota_us from 60000 to 40000, you are expected to see qemu cpu cost will go down from 60% to 40%. This can let you know it really works.

test steps with k8s + cri-containerd

Use this pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: nginx-untrusted
  annotations:
    io.kubernetes.cri.untrusted-workload: "true"
spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"
  - name: busybox
    image: busybox
    command: ['top']
    resources:
      requests:
        memory: "64Mi"
        cpu: "100m"
      limits:
        memory: "128Mi"
        cpu: "200m"

There will be 2 containers inside the POD, nginx with a cpu quota/period of "50000/100000", and busybox with a cpu quota/period of "20000/100000", so value of cpu.cfs_quota_us and cpu.cfs_period_us in /sys/fs/cgroup/cpu/vc/<sandbox-id>/vcpu/ would be "70000" and "10000",
that's expected.

issue found

When tested with cri-containerd with above pod spec, the resource limit in cgroup is set to 0.7 core as expected, but the VM got 3 vcpu, that's not right.

Guess that's because we hotplugged two vcpus for two containers, plus default one vcpu. We should do more accurate calculation for VCPU numbers.

katacontainersbot · 2018-09-15T09:40:54Z

PSS Measurement:
Qemu: 167489 KB
Proxy: 4350 KB
Shim: 9018 KB

Memory inside container:
Total Memory: 2043460 KB
Free Memory: 2006696 KB

katacontainersbot · 2018-09-15T10:01:39Z

PSS Measurement:
Qemu: 167264 KB
Proxy: 4163 KB
Shim: 8891 KB

Memory inside container:
Total Memory: 2043460 KB
Free Memory: 2006696 KB

katacontainersbot · 2018-09-15T10:23:30Z

PSS Measurement:
Qemu: 173229 KB
Proxy: 4047 KB
Shim: 8897 KB

Memory inside container:
Total Memory: 2043460 KB
Free Memory: 2006884 KB

codecov · 2018-09-15T10:49:52Z

Codecov Report

Merging #734 into master will decrease coverage by 0.34%.
The diff coverage is 46.29%.

@@            Coverage Diff             @@
##           master     #734      +/-   ##
==========================================
- Coverage   66.09%   65.75%   -0.35%     
==========================================
  Files          87       88       +1     
  Lines       10705    10685      -20     
==========================================
- Hits         7076     7026      -50     
- Misses       2897     2919      +22     
- Partials      732      740       +8

bergwolf · 2018-09-17T02:13:24Z

A huge PR but it is really something we need. I'll take a closer look later today. Thanks @WeiZhang555 !

bergwolf

Generally looks good. A few comments inline.

bergwolf · 2018-09-17T07:51:26Z

virtcontainers/sandbox.go

@@ -31,7 +31,9 @@ import (

 // vmStartTimeout represents the time in seconds a sandbox can wait before
 // to consider the VM starting operation failed.
-const vmStartTimeout = 10
+const (
+	vmStartTimeout = 10


Please move the comments as well if the braces are intentional.

bergwolf · 2018-09-17T08:04:20Z

virtcontainers/api.go

@@ -122,6 +122,11 @@ func createSandboxFromConfig(ctx context.Context, sandboxConfig SandboxConfig, f
 		return nil, err
 	}

+	// Setup host cgroups
+	if err := s.setupCgroups(); err != nil {


This is already called in createContainers(). Are you trying to make sure there is host cpu cg even for an empty sandbox?

No, this s.setupCgroups() in createSandboxFromConfig is for the initial sandbox container, in this sandbox it has exactly one container. Another call of s.setupCgroups() is in s.CreateContainer(). The setup process only happens once for each container(including sandbox container), there's no duplicate call.

@WeiZhang555 There is a code path createSandboxFromConfig -> s.createContainers -> createContainer -> setupCgroups for the initial container in the sandbox as well.

createContainer -> setupCgroups doesn't exist.

Maybe you mixed createContainer() from virtcontainers/container.go with the one from cli/create.go ?

oops, you are right! I mixed the two createcontainer() functions in sandbox.go and container.go, though they have a initial capital difference. Sry for the noise...

bergwolf · 2018-09-17T08:25:11Z

virtcontainers/cgroups.go

+
+		// TODO: how to handle empty/unlimited resource?
+		// maybe we should add a default CPU/Memory delta when no
+		// resource limit is given. -- @WeiZhang555


Default cpu quota might conflict globally on the host if we exceed the total available CPU time. While users might set it to exceed, we should not make it happen unintentionally. docker/runc does not set default quota for containers either.

The worry is, if there's two containers inside a POD, one with quota 6000, the other one with quota -1 (unlimited), total quota will be 6000.

That means with my calculation, one container with 0.6 core + one container with unlimited cores = 0.6 core.
Not sure if this can satisfy everyone, I can only suggest user, if you set a limit for one container, you should set limit for every container too...

please file an issue

@devimc Here it is: #743 😄

bergwolf · 2018-09-17T08:25:27Z

virtcontainers/cgroups.go

+				// so use previous period 10000 as a baseline, container B
+				// has proportional resource of quota 4000 and period 10000, calculated as
+				// delta := 40 / 100 * 10000 = 4000
+				// and final `*resource.CPU.Quota` = 5000 + 4000 = 9000


Nice! I like the proportional calculation and the comments!

bergwolf · 2018-09-17T08:30:43Z

@WeiZhang555 Nice patch, thanks! w.r.t. the containerd bug, is it possible to push containerd/cgroups#54 forward? Also I think we should update our spec version even though we do not support rdma right now. Then we can import from containerd directly ;)

WeiZhang555 · 2018-09-17T09:24:45Z

@bergwolf

Nice patch, thanks! w.r.t. the containerd bug, is it possible to push containerd/cgroups#54 forward?

I think this is achivable, just can't be sure how long it takes.

Also I think we should update our spec version even though we do not support rdma right now. Then we can import from containerd directly ;)

I need to check if bumping runtime-spec breaks anything. Currently kata-runtime and kata-agent are using latest stable release v1.0.1, I think we have more reasons to keep it v1.0.1 as it's a stable release, you know, comparing to an in-developing master branch.

crosbymichael · 2018-09-17T17:35:03Z

We bumped the spec in containerd, you shouldn't have any backward incompat issues

WeiZhang555 · 2018-09-18T02:10:27Z

Hi @crosbymichael , thank you for your response! Then I think bumping kata-runtime's runtime spec version is the right way, kata-runtime/agent should use same/similiar version of runtime spec with containerd 😄

raravena80 · 2018-10-22T16:40:06Z

@WeiZhang555 ping from your weekly Kata herder.

WeiZhang555 · 2018-10-24T07:31:22Z

Rebased.

ping @devimc , what do you think of this #734 (comment) ?

caoruidong · 2018-10-24T07:58:15Z

/test

WeiZhang555 · 2018-10-25T03:42:26Z

It seems we got some LGTMs from @bergwolf @devimc @jshachm .
@sboeuf @jodh-intel @egernst do you want to take another look at this?

Two legacy issues need to be resolved in following PR:

honor cgroupsPath in config.json
honor --systemd-cgroup flags

caoruidong · 2018-10-25T03:56:09Z

virtcontainers/cgroups.go

+	// immediately as default behaviour.
+	if len(tids.vcpus) > 0 {
+		if err := s.cgroup.sandboxSub.Add(cgroups.Process{
+			Pid: tids.vcpus[0],


So are vcpus[] thread ids or cpu ids? I'm a little confused.

They are thread ids

Add new vendor library "github.com/containerd/cgroups" commit: 5017d4e9a9cf2d4381db99eacd9baf84b95bfb14 This library is for host cgroup support for next commit. Signed-off-by: Wei Zhang <[email protected]>

Fixes kata-containers#344 Add host cgroup support for kata. This commits only adds cpu.cfs_period and cpu.cfs_quota support. It will create 3-level hierarchy, take "cpu" cgroup as an example: ``` /sys/fs/cgroup |---cpu |---kata |---<sandbox-id> |--vcpu |---<sandbox-id> ``` * `vc` cgroup is common parent for all kata-container sandbox, it won't be removed after sandbox removed. This cgroup has no limitation. * `<sandbox-id>` cgroup is the layer for each sandbox, it contains all other qemu threads except for vcpu threads. In future, we can consider putting all shim processes and proxy process here. This cgroup has no limitation yet. * `vcpu` cgroup contains vcpu threads from qemu. Currently cpu quota and period constraint applies to this cgroup. Signed-off-by: Wei Zhang <[email protected]> Signed-off-by: Jingxiao Lu <[email protected]>

WeiZhang555 · 2018-10-27T01:44:08Z

/test

WeiZhang555 · 2018-10-27T08:03:43Z

CI passed. Merging this as we have enough LGTMs and it has been pending for long time.

crosbymichael · 2018-10-29T14:49:21Z

Yeah! Congrats

WeiZhang555 · 2018-11-07T06:21:05Z

@liangxianlong I'm not sure about your meaning by "reuse" my code. If you want to reuse kata-runtime codes to implement new feature, please go ahead and do it! This is an open source project and you're welcome to use and contribute!

liangxianlong · 2018-11-07T11:12:02Z

@WeiZhang555 I test your code, i run this command "docker run -ti --cpuset-cpus 1 busybox /bin/sh",and the i see the directory on my host: "/sys/fs/cgroup/cpuset/kata/e76fe6c34d9e75333b06091e0a68095a470ba4333c3de4440e99010029f1674a/vcpu". But if we have two vcpus, i think the directory should like this "/sys/fs/cgroup/cpuset/kata/e76fe6c34d9e75333b06091e0a68095a470ba4333c3de4440e99010029f1674a/vcpu0" and "/sys/fs/cgroup/cpuset/kata/e76fe6c34d9e75333b06091e0a68095a470ba4333c3de4440e99010029f1674a/vcpu1"

WeiZhang555 · 2018-11-08T02:31:25Z

@liangxianlong So you are trying to support cpuset, currently only cfs_quota and cfs_period are supported.

cpuset support could be more complicated, it depends on our cgroup setting policy.

we have host cgroup and guest cgroup support, supporting cpuset need coordinate between guest and host cgroup
suppose a container with 1.5 core and cpuset "0-1", we don't need to set cpuset seperately for vcpu 0 and 1, we can put vcpus in same cgroup and write "0-1" in cgroup config, you don't have to put them in seperate dir

liangxianlong · 2018-11-08T02:48:36Z

@WeiZhang555 Now，i don't care container. In my test, after "docker run -ti --cpuset-cpus 1 busybox /bin/sh ", there will be two results, (1)the container's process in vm is bound to vcpu1, (2) build a diretory on host "/sys/fs/cgroup/cpuset/kata/${sandbox-id}/vcpu". The vcpux is just a qemu thread,so if we want to realize this: Bind the vcpu to the physical cpu;Does the code need some modifications?

WeiZhang555 · 2018-11-08T07:09:12Z

@liangxianlong This is achievable, you can enhance the code to add cpuset support, it should be easy.

liangxianlong · 2018-11-08T07:30:45Z

@liangxianlong This is achievable, you can enhance the code to add cpuset support, it should be easy.

thanks. Another question, if i run "docker run -ti busybox /bin/sh ", two directory will be built on my host,
(1) /sys/fs/cgroup/cpuset/kata/${sandbox-id}/vcpu; (2) /sys/fs/cgroup/cpu/kata/${sandbox-id}/vcpu. I think it should only build "/sys/fs/cgroup/cpu/kata/${sandbox-id}/vcpu". Why does the code create two directories?

WeiZhang555 · 2018-11-08T08:54:12Z

@liangxianlong That's because github,com/containerd/cgroups doesn't give us an interface to only build cpu/kata/vcpu without building cpuset/kata/vcpu, it's a miss of cgroups lib api.

By the way, it's better to open another issue for discussing and tracking this, discussing under a closed PR may be ignored by other people, this is not right place 😄

liangxianlong · 2018-11-08T09:17:33Z

@WeiZhang555 think s! I'm new to kata and interested in it，so please bear with me. xixi.

liangxianlong · 2018-11-12T07:23:33Z

@WeiZhang555 Regarding this PR, I asked a question, please take a look, thank you.
#901

protocols: client: Add timeout for hybrid vsock handshake

egernst added the review label Sep 15, 2018

WeiZhang555 force-pushed the add-cgroup-support branch from 959b4c8 to bb94d65 Compare September 15, 2018 07:11

WeiZhang555 mentioned this pull request Sep 15, 2018

[WIP] Refactor cgroup handling #416

Closed

This comment has been minimized.

Sign in to view

WeiZhang555 force-pushed the add-cgroup-support branch 2 times, most recently from ef0d364 to b121ac4 Compare September 15, 2018 07:39

This comment has been minimized.

Sign in to view

WeiZhang555 force-pushed the add-cgroup-support branch 3 times, most recently from 2a6fede to a231690 Compare September 15, 2018 09:30

WeiZhang555 force-pushed the add-cgroup-support branch from a231690 to 90aca15 Compare September 15, 2018 09:52

kata-containers deleted a comment from katacontainersbot Sep 15, 2018

WeiZhang555 force-pushed the add-cgroup-support branch from 90aca15 to 9ba8baf Compare September 15, 2018 10:14

WeiZhang555 added the do-not-merge label Sep 17, 2018

bergwolf reviewed Sep 17, 2018

View reviewed changes

WeiZhang555 changed the title ~~Add cgroup support~~ [WIP ]Add cgroup support (but ready for review) Sep 18, 2018

WeiZhang555 changed the title ~~[WIP ]Add cgroup support (but ready for review)~~ [WIP]Add cgroup support (ready for review) Sep 18, 2018

WeiZhang555 force-pushed the add-cgroup-support branch from a280af6 to 1abfb00 Compare October 24, 2018 07:29

devimc approved these changes Oct 24, 2018

View reviewed changes

caoruidong reviewed Oct 25, 2018

View reviewed changes

WeiZhang555 added 2 commits October 27, 2018 09:41

vendor: add github.com/containerd/cgroups lib

523d49c

Add new vendor library "github.com/containerd/cgroups" commit: 5017d4e9a9cf2d4381db99eacd9baf84b95bfb14 This library is for host cgroup support for next commit. Signed-off-by: Wei Zhang <[email protected]>

WeiZhang555 force-pushed the add-cgroup-support branch from 1abfb00 to 34fe3b9 Compare October 27, 2018 01:43

WeiZhang555 merged commit 95386fb into kata-containers:master Oct 27, 2018

WeiZhang555 deleted the add-cgroup-support branch October 27, 2018 08:04

jingxiaolu mentioned this pull request Oct 27, 2018

virtcontainers: Make sandbox manage VM CPUs. #833

Closed

jodh-intel mentioned this pull request Oct 29, 2018

Test failures running as non-root #863

Closed

liangxianlong mentioned this pull request Nov 12, 2018

docker update doubt with PR https://github.com/kata-containers/runtime/pull/734 #901

Closed

WeiZhang555 mentioned this pull request Dec 14, 2018

cgroup name and path are not honoured #1021

Closed

Ace-Tang mentioned this pull request Feb 20, 2019

cgroups: some question about kata host cgroup set #1256

Closed

egernst pushed a commit to egernst/runtime that referenced this pull request Feb 9, 2021

Merge pull request kata-containers#734 from jcvenegas/fix-372

d26a505

protocols: client: Add timeout for hybrid vsock handshake

Add cgroup support #734

Add cgroup support #734

Conversation

WeiZhang555 commented Sep 15, 2018

WeiZhang555 commented Sep 15, 2018

This comment has been minimized.

WeiZhang555 commented Sep 15, 2018 • edited Loading

Note 1:

Note 2:

This comment has been minimized.

WeiZhang555 commented Sep 15, 2018

test steps with docker:

test steps with k8s + cri-containerd

issue found

katacontainersbot commented Sep 15, 2018

katacontainersbot commented Sep 15, 2018

katacontainersbot commented Sep 15, 2018

codecov bot commented Sep 15, 2018 • edited Loading

Codecov Report

bergwolf commented Sep 17, 2018

bergwolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bergwolf commented Sep 17, 2018

WeiZhang555 commented Sep 17, 2018

crosbymichael commented Sep 17, 2018

WeiZhang555 commented Sep 18, 2018

raravena80 commented Oct 22, 2018

WeiZhang555 commented Oct 24, 2018

caoruidong commented Oct 24, 2018

WeiZhang555 commented Oct 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WeiZhang555 commented Oct 27, 2018

WeiZhang555 commented Oct 27, 2018

crosbymichael commented Oct 29, 2018

WeiZhang555 commented Nov 7, 2018 • edited Loading

liangxianlong commented Nov 7, 2018 • edited Loading

WeiZhang555 commented Nov 8, 2018

liangxianlong commented Nov 8, 2018 • edited Loading

WeiZhang555 commented Nov 8, 2018

liangxianlong commented Nov 8, 2018

WeiZhang555 commented Nov 8, 2018

liangxianlong commented Nov 8, 2018

liangxianlong commented Nov 12, 2018 • edited by caoruidong Loading

WeiZhang555 commented Sep 15, 2018 •

edited

Loading

codecov bot commented Sep 15, 2018 •

edited

Loading

WeiZhang555 commented Oct 25, 2018 •

edited

Loading

WeiZhang555 commented Nov 7, 2018 •

edited

Loading

liangxianlong commented Nov 7, 2018 •

edited

Loading

liangxianlong commented Nov 8, 2018 •

edited

Loading

liangxianlong commented Nov 12, 2018 •

edited by caoruidong

Loading