Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

Add cgroup support #734

Merged
merged 2 commits into from
Oct 27, 2018

Conversation

WeiZhang555
Copy link
Member

cgroups: add host cgroup support

Fixes #344

Add host cgroup support for kata.

This commits only adds cpu.cfs_period and cpu.cfs_quota support.

It will create 3-level hierarchy, take "cpu" cgroup as an example:

/sys/fs/cgroup
|---cpu
   |---vc
      |---<sandbox-id>
         |--vcpu
      |---<sandbox-id>
  • vc cgroup is common parent for all kata-container sandbox, it won't be removed
    after sandbox removed. This cgroup has no limitation.
  • <sandbox-id> cgroup is the layer for each sandbox, it contains all other qemu
    threads except for vcpu threads. In future, we can consider putting all shim
    processes and proxy process here. This cgroup has no limitation yet.
  • vcpu cgroup contains vcpu threads from qemu. Currently cpu quota and period
    constraint applies to this cgroup.

Signed-off-by: Wei Zhang [email protected]
Signed-off-by: Jingxiao Lu [email protected]

@WeiZhang555
Copy link
Member Author

Replace #416

@katacontainersbot

This comment has been minimized.

@WeiZhang555
Copy link
Member Author

WeiZhang555 commented Sep 15, 2018

Note 1:

Currently I imported "github.com/WeiZhang555/cgroups", this repo is actually a fork of master branch of github.com/containerd/cgroups. The reason I made a fork instead of original code is:

  1. latest code of github.com/containerd/cgroups uses a higher version of runtime-spec, which contains new RDMA cgroup support: https://github.com/containerd/cgroups/blob/master/rdma.go, importing this repo requires we bump runtime-spec for both kata-runtime and kata-agent.
  2. latest code has bug according to test, raised a PR to fix: Bugfix: can't write to cpuset cgroup containerd/cgroups#54
  3. another way is we use an older version of github.com/containerd/cgroups without RDMA codes, according to my test, the latest code without RDMA support also misses an important function: AddTask(), without this, I can't do fine-granularity resource limitation.

Combining the issues above, I will suggest:
** 1. we fork latest code from github.com/containerd/cgroups under kata-containers, but removes RDMA codes, including the bugfix I mentioned, then this PR can vendor github.com/kata-containers/cgroups instead of github.com/WeiZhang555/cgroups.**


Note 2:

govmm needs a vendor, will do it once this PR looks good

@WeiZhang555 WeiZhang555 force-pushed the add-cgroup-support branch 2 times, most recently from ef0d364 to b121ac4 Compare September 15, 2018 07:39
@katacontainersbot

This comment has been minimized.

@WeiZhang555
Copy link
Member Author

Because there's cgroup support in kata-agent, so we need to do some trick to see if this really works.

test steps with docker:

# docker run --rm -it --cpu-quota 60000 --cpu-period 100000 --runtime kata progrium/stress --cpu 4  --timeout 600s

Top on host, you can see the qemu process takes about 60% cpu usage.

then enter /sys/fs/cgroup/cpu/vc/<container-id>/vcpu, modify content of cpu.cfs_quota_us from 60000 to 40000, you are expected to see qemu cpu cost will go down from 60% to 40%. This can let you know it really works.

test steps with k8s + cri-containerd

Use this pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: nginx-untrusted
  annotations:
    io.kubernetes.cri.untrusted-workload: "true"
spec:
  containers:
  - name: nginx
    image: nginx
    resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
      limits:
        memory: "128Mi"
        cpu: "500m"
  - name: busybox
    image: busybox
    command: ['top']
    resources:
      requests:
        memory: "64Mi"
        cpu: "100m"
      limits:
        memory: "128Mi"
        cpu: "200m"

There will be 2 containers inside the POD, nginx with a cpu quota/period of "50000/100000", and busybox with a cpu quota/period of "20000/100000", so value of cpu.cfs_quota_us and cpu.cfs_period_us in /sys/fs/cgroup/cpu/vc/<sandbox-id>/vcpu/ would be "70000" and "10000",
that's expected.

issue found

When tested with cri-containerd with above pod spec, the resource limit in cgroup is set to 0.7 core as expected, but the VM got 3 vcpu, that's not right.

Guess that's because we hotplugged two vcpus for two containers, plus default one vcpu. We should do more accurate calculation for VCPU numbers.

@WeiZhang555 WeiZhang555 force-pushed the add-cgroup-support branch 3 times, most recently from 2a6fede to a231690 Compare September 15, 2018 09:30
@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 167489 KB
Proxy: 4350 KB
Shim: 9018 KB

Memory inside container:
Total Memory: 2043460 KB
Free Memory: 2006696 KB

@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 167264 KB
Proxy: 4163 KB
Shim: 8891 KB

Memory inside container:
Total Memory: 2043460 KB
Free Memory: 2006696 KB

@katacontainersbot
Copy link
Contributor

PSS Measurement:
Qemu: 173229 KB
Proxy: 4047 KB
Shim: 8897 KB

Memory inside container:
Total Memory: 2043460 KB
Free Memory: 2006884 KB

@codecov
Copy link

codecov bot commented Sep 15, 2018

Codecov Report

Merging #734 into master will decrease coverage by 0.34%.
The diff coverage is 46.29%.

@@            Coverage Diff             @@
##           master     #734      +/-   ##
==========================================
- Coverage   66.09%   65.75%   -0.35%     
==========================================
  Files          87       88       +1     
  Lines       10705    10685      -20     
==========================================
- Hits         7076     7026      -50     
- Misses       2897     2919      +22     
- Partials      732      740       +8

@bergwolf
Copy link
Member

A huge PR but it is really something we need. I'll take a closer look later today. Thanks @WeiZhang555 !

Copy link
Member

@bergwolf bergwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. A few comments inline.

@@ -31,7 +31,9 @@ import (

// vmStartTimeout represents the time in seconds a sandbox can wait before
// to consider the VM starting operation failed.
const vmStartTimeout = 10
const (
vmStartTimeout = 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the comments as well if the braces are intentional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!

@@ -122,6 +122,11 @@ func createSandboxFromConfig(ctx context.Context, sandboxConfig SandboxConfig, f
return nil, err
}

// Setup host cgroups
if err := s.setupCgroups(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already called in createContainers(). Are you trying to make sure there is host cpu cg even for an empty sandbox?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this s.setupCgroups() in createSandboxFromConfig is for the initial sandbox container, in this sandbox it has exactly one container. Another call of s.setupCgroups() is in s.CreateContainer(). The setup process only happens once for each container(including sandbox container), there's no duplicate call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeiZhang555 There is a code path createSandboxFromConfig -> s.createContainers -> createContainer -> setupCgroups for the initial container in the sandbox as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createContainer -> setupCgroups doesn't exist.

Maybe you mixed createContainer() from virtcontainers/container.go with the one from cli/create.go ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, you are right! I mixed the two createcontainer() functions in sandbox.go and container.go, though they have a initial capital difference. Sry for the noise...


// TODO: how to handle empty/unlimited resource?
// maybe we should add a default CPU/Memory delta when no
// resource limit is given. -- @WeiZhang555
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default cpu quota might conflict globally on the host if we exceed the total available CPU time. While users might set it to exceed, we should not make it happen unintentionally. docker/runc does not set default quota for containers either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worry is, if there's two containers inside a POD, one with quota 6000, the other one with quota -1 (unlimited), total quota will be 6000.

That means with my calculation, one container with 0.6 core + one container with unlimited cores = 0.6 core.
Not sure if this can satisfy everyone, I can only suggest user, if you set a limit for one container, you should set limit for every container too...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please file an issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devimc Here it is: #743 😄

// so use previous period 10000 as a baseline, container B
// has proportional resource of quota 4000 and period 10000, calculated as
// delta := 40 / 100 * 10000 = 4000
// and final `*resource.CPU.Quota` = 5000 + 4000 = 9000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like the proportional calculation and the comments!

@bergwolf
Copy link
Member

@WeiZhang555 Nice patch, thanks! w.r.t. the containerd bug, is it possible to push containerd/cgroups#54 forward? Also I think we should update our spec version even though we do not support rdma right now. Then we can import from containerd directly ;)

@WeiZhang555
Copy link
Member Author

@bergwolf

Nice patch, thanks! w.r.t. the containerd bug, is it possible to push containerd/cgroups#54 forward?

I think this is achivable, just can't be sure how long it takes.

Also I think we should update our spec version even though we do not support rdma right now. Then we can import from containerd directly ;)

I need to check if bumping runtime-spec breaks anything. Currently kata-runtime and kata-agent are using latest stable release v1.0.1, I think we have more reasons to keep it v1.0.1 as it's a stable release, you know, comparing to an in-developing master branch.

@crosbymichael
Copy link

We bumped the spec in containerd, you shouldn't have any backward incompat issues

@WeiZhang555
Copy link
Member Author

Hi @crosbymichael , thank you for your response! Then I think bumping kata-runtime's runtime spec version is the right way, kata-runtime/agent should use same/similiar version of runtime spec with containerd 😄

@WeiZhang555 WeiZhang555 changed the title Add cgroup support [WIP ]Add cgroup support (but ready for review) Sep 18, 2018
@WeiZhang555 WeiZhang555 changed the title [WIP ]Add cgroup support (but ready for review) [WIP]Add cgroup support (ready for review) Sep 18, 2018
@raravena80
Copy link
Member

@WeiZhang555 ping from your weekly Kata herder.

@WeiZhang555
Copy link
Member Author

Rebased.

ping @devimc , what do you think of this #734 (comment) ?

@caoruidong
Copy link
Member

/test

@WeiZhang555
Copy link
Member Author

WeiZhang555 commented Oct 25, 2018

It seems we got some LGTMs from @bergwolf @devimc @jshachm .
@sboeuf @jodh-intel @egernst do you want to take another look at this?

Two legacy issues need to be resolved in following PR:

  • honor cgroupsPath in config.json
  • honor --systemd-cgroup flags

// immediately as default behaviour.
if len(tids.vcpus) > 0 {
if err := s.cgroup.sandboxSub.Add(cgroups.Process{
Pid: tids.vcpus[0],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So are vcpus[] thread ids or cpu ids? I'm a little confused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are thread ids

Add new vendor library "github.com/containerd/cgroups"
commit: 5017d4e9a9cf2d4381db99eacd9baf84b95bfb14

This library is for host cgroup support for next commit.

Signed-off-by: Wei Zhang <[email protected]>
Fixes kata-containers#344

Add host cgroup support for kata.

This commits only adds cpu.cfs_period and cpu.cfs_quota support.

It will create 3-level hierarchy, take "cpu" cgroup as an example:

```
/sys/fs/cgroup
|---cpu
   |---kata
      |---<sandbox-id>
         |--vcpu
      |---<sandbox-id>
```

* `vc` cgroup is common parent for all kata-container sandbox, it won't be removed
after sandbox removed. This cgroup has no limitation.
* `<sandbox-id>` cgroup is the layer for each sandbox, it contains all other qemu
threads except for vcpu threads. In future, we can consider putting all shim
processes and proxy process here. This cgroup has no limitation yet.
* `vcpu` cgroup contains vcpu threads from qemu. Currently cpu quota and period
constraint applies to this cgroup.

Signed-off-by: Wei Zhang <[email protected]>
Signed-off-by: Jingxiao Lu <[email protected]>
@WeiZhang555
Copy link
Member Author

/test

@WeiZhang555
Copy link
Member Author

CI passed. Merging this as we have enough LGTMs and it has been pending for long time.

@WeiZhang555 WeiZhang555 merged commit 95386fb into kata-containers:master Oct 27, 2018
@WeiZhang555 WeiZhang555 deleted the add-cgroup-support branch October 27, 2018 08:04
@crosbymichael
Copy link

Yeah! Congrats

@WeiZhang555
Copy link
Member Author

WeiZhang555 commented Nov 7, 2018

@liangxianlong I'm not sure about your meaning by "reuse" my code. If you want to reuse kata-runtime codes to implement new feature, please go ahead and do it! This is an open source project and you're welcome to use and contribute!

@liangxianlong
Copy link

liangxianlong commented Nov 7, 2018

@WeiZhang555 I test your code, i run this command "docker run -ti --cpuset-cpus 1 busybox /bin/sh",and the i see the directory on my host: "/sys/fs/cgroup/cpuset/kata/e76fe6c34d9e75333b06091e0a68095a470ba4333c3de4440e99010029f1674a/vcpu". But if we have two vcpus, i think the directory should like this "/sys/fs/cgroup/cpuset/kata/e76fe6c34d9e75333b06091e0a68095a470ba4333c3de4440e99010029f1674a/vcpu0" and "/sys/fs/cgroup/cpuset/kata/e76fe6c34d9e75333b06091e0a68095a470ba4333c3de4440e99010029f1674a/vcpu1"

@WeiZhang555
Copy link
Member Author

@liangxianlong So you are trying to support cpuset, currently only cfs_quota and cfs_period are supported.

cpuset support could be more complicated, it depends on our cgroup setting policy.

  1. we have host cgroup and guest cgroup support, supporting cpuset need coordinate between guest and host cgroup
  2. suppose a container with 1.5 core and cpuset "0-1", we don't need to set cpuset seperately for vcpu 0 and 1, we can put vcpus in same cgroup and write "0-1" in cgroup config, you don't have to put them in seperate dir

@liangxianlong
Copy link

liangxianlong commented Nov 8, 2018

@WeiZhang555 Now,i don't care container. In my test, after "docker run -ti --cpuset-cpus 1 busybox /bin/sh ", there will be two results, (1)the container's process in vm is bound to vcpu1, (2) build a diretory on host "/sys/fs/cgroup/cpuset/kata/${sandbox-id}/vcpu". The vcpux is just a qemu thread,so if we want to realize this: Bind the vcpu to the physical cpu;Does the code need some modifications?

@WeiZhang555
Copy link
Member Author

@liangxianlong This is achievable, you can enhance the code to add cpuset support, it should be easy.

@liangxianlong
Copy link

@liangxianlong This is achievable, you can enhance the code to add cpuset support, it should be easy.

thanks. Another question, if i run "docker run -ti busybox /bin/sh ", two directory will be built on my host,
(1) /sys/fs/cgroup/cpuset/kata/${sandbox-id}/vcpu; (2) /sys/fs/cgroup/cpu/kata/${sandbox-id}/vcpu. I think it should only build "/sys/fs/cgroup/cpu/kata/${sandbox-id}/vcpu". Why does the code create two directories?

@WeiZhang555
Copy link
Member Author

@liangxianlong That's because github,com/containerd/cgroups doesn't give us an interface to only build cpu/kata/vcpu without building cpuset/kata/vcpu, it's a miss of cgroups lib api.

By the way, it's better to open another issue for discussing and tracking this, discussing under a closed PR may be ignored by other people, this is not right place 😄

@liangxianlong
Copy link

@WeiZhang555 think s! I'm new to kata and interested in it,so please bear with me. xixi.

@liangxianlong
Copy link

liangxianlong commented Nov 12, 2018

@WeiZhang555 Regarding this PR, I asked a question, please take a look, thank you.
#901

egernst pushed a commit to egernst/runtime that referenced this pull request Feb 9, 2021
protocols: client: Add timeout for hybrid vsock handshake
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.