Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

virtcontainers: reimplement sandbox cgroup #1189

Merged
merged 5 commits into from
Feb 20, 2019

Conversation

devimc
Copy link

@devimc devimc commented Jan 28, 2019

All containers run in different cgroups even the sandbox, with this new
implementation the sandbox cpu cgroup wil be equal to the sum of all its
containers and the hypervisor process will be placed there impacting to the
containers running in the sandbox (VM). The default number of vcpus is
used when the sandbox has no constraints. For example, if default_vcpus
is 2, then quota will be 200000 and period 100000.

c-ray test
http://www.futuretech.blinkenlights.nl/c-ray.html

+=============================================+
|         | 6 threads 6cpus | 1 thread 1 cpu  |
+=============================================+
| current |   40 seconds    |   122 seconds   |
+==============================================
|   new   |   37 seconds    |   124 seconds   |
+==============================================

current = current cgroups implementation
new = new cgroups implementation

workload

apiVersion: v1
kind: Pod
metadata:
name: c-ray
annotations:
io.kubernetes.cri.untrusted-workload: "true"
spec:
restartPolicy: Never
containers:
- name: c-ray-1
image: docker.io/devimc/c-ray:latest
imagePullPolicy: IfNotPresent
args: ["-t", "6", "-s", "1600x1200", "-r", "8", "-i",
      "/c-ray-1.1/sphfract", "-o", "/tmp/output.ppm"]
resources:
  limits:
    cpu: 6
- name: c-ray-2
image: docker.io/devimc/c-ray:latest
imagePullPolicy: IfNotPresent
args: ["-t", "1", "-s", "1600x1200", "-r", "8", "-i",
      "/c-ray-1.1/sphfract", "-o", "/tmp/output.ppm"]
resources:
  limits:
    cpu: 1

fixes #1125

Signed-off-by: Julio Montes [email protected]

@devimc devimc requested a review from WeiZhang555 January 28, 2019 21:56
@devimc devimc force-pushed the topic/fixCpuCgroup branch from 009a12a to 2da5814 Compare January 28, 2019 21:57
@devimc
Copy link
Author

devimc commented Jan 28, 2019

@WeiZhang555 please take a look

@devimc devimc requested a review from jcvenegas January 29, 2019 15:22
@devimc
Copy link
Author

devimc commented Jan 29, 2019

@jcvenegas please take a look

@WeiZhang555
Copy link
Member

/test

Copy link
Member

@WeiZhang555 WeiZhang555 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort! I left some comments, great work!

virtcontainers/container.go Outdated Show resolved Hide resolved
virtcontainers/container.go Outdated Show resolved Hide resolved
virtcontainers/cgroups.go Outdated Show resolved Hide resolved
for _, i := range tids.vcpus {
if i <= 0 {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding only the vcpus threads is an optimization, which can benefit to small chunks IO performance according to our previous test.

It's OK to remove this if it makes the handling logic too complicated, I'm just wishing to share the performance tuning result to community.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure I can check that, what tests did you run?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can test with some network benchmark tools such as netperf, or disk IO benchmark tools such as FIO or iozone. Try to send/receive with small chunks(e.g. 4k package), for one VM with 1 cpu, you can compare W/ or W/O "cpu cgroup of 1 cpu constraint", I believe you can see how the performance is different.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, performance is not good moving the whole hypervisor, I have updated the pull request, now the performance is good, thanks

cgroup1

cgroup2

current - current implementation
new - new implementation
no-cgroup - container without constraints

// first move all processes in subgroup to parent in case live process blocks
// deletion of cgroup
if err := s.cgroup.sandboxSub.MoveTo(s.cgroup.commonParent); err != nil {
return fmt.Errorf("failed to clear cgroup processes")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my previous test, a left-over process (zoombie or removed too slowly) can block the cgroup's deletion which will cause cgroup dir left on disk after sandbox destroyed.

Have you tested this case? I don't see similar code in deleteCgroups

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a simple test, and seems like cgroups are removed properly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed lefacy cgroup occasinally before, without specified test case.

I think you can emulate the behaviour by adding some other pid from host OS to that cgroup, and when you delete the container, the cgroup dir will be left. This is a easy emulation not the real case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, cgroups are removed, I filed an issue to test it once this pr land

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

virtcontainers/cgroups.go Show resolved Hide resolved

// CgroupPath is the cgroup hierarchy where sandbox's processes
// including the hypervisor are placed.
CgroupPath string `json:"cgroupPath,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can bring potential broken compatibility.

Suppose sandboxA is created by old kata-runtime and runs for some time, after we upgrade kata-runtime, and when using the new kata-runtime to delete the old container, it will search state.json file from disk and end up to find nothing, The cgroup dir can fail to delete and garbage left.

That's why https://github.com/kata-containers/runtime/pull/883 is important.

The influence is not that big caused by this modification, so I can accept the modification so far.

@devimc devimc force-pushed the topic/fixCpuCgroup branch from 2da5814 to 747636e Compare January 30, 2019 19:45
@devimc
Copy link
Author

devimc commented Jan 30, 2019

@WeiZhang555 changes applied, thanks

@devimc devimc force-pushed the topic/fixCpuCgroup branch from 747636e to df502f1 Compare January 31, 2019 21:08
func (c *Container) newCgroups() error {
ann := c.GetAnnotations()

config, ok := ann[annotations.ConfigJSONKey]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use in many places the config json, at some point better to move it from annotations, virtcontainers is not agnostic at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We can definitely save the config in Container . This way we can save many disk read and json Unmarshaling.

Copy link
Member

@jcvenegas jcvenegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already tested cpuset with k8s and docker they are working as I expected, I added a quick review, I'll do a second round later.

sandboxSub, err := parent.New(s.id, &specs.LinuxResources{})
// V1Constraints returns the cgroups that are compatible with th VC architecture
// and hypervisor, constraints can be applied to these cgroups.
func V1Constraints() ([]cgroups.Subsystem, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont fully get the naming of the cunftions it is because of cgroups v1? I prefer remove that prefix if does not add any information.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fields = strings.Split(text, " ")
// safe as mountinfo encodes mountpoints with spaces as \040.
index = strings.Index(text, " - ")
postSeparatorFields = strings.Fields(text[index+3:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if numPostFields == 0 {
return "", fmt.Errorf("Found no fields post '-' in %q", text)
}
if postSeparatorFields[0] == "cgroup" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index access without check len

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)
// this is an error as we can't detect if the mount is for "cgroup"
if numPostFields == 0 {
return "", fmt.Errorf("Found no fields post '-' in %q", text)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really useful error (at least for me)

}
if postSeparatorFields[0] == "cgroup" {
// check that the mount is properly formated.
if numPostFields < 3 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if numPostFields < 3 {
return "", fmt.Errorf("Error found less than 3 fields post '-' in %q", text)
}
return filepath.Dir(fields[4]), nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index access with out check length

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return filepath.Dir(fields[4]), nil
}
}
return "", cgroups.ErrMountPointNotExist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if cgroups api does not provide a function like this.
From this file what is string is supose to return ?

cat /proc/self/mountinfo  | grep cgroup
27 20 0:24 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:4 - tmpfs tmpfs ro,seclabel,mode=755
28 27 0:25 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:5 - cgroup2 cgroup2 rw,seclabel,nsdelegate
29 27 0:26 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,seclabel,xattr,name=systemd
33 27 0:30 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:7 - cgroup cgroup rw,seclabel,cpu,cpuacct
34 27 0:31 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:8 - cgroup cgroup rw,seclabel,freezer
35 27 0:32 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:9 - cgroup cgroup rw,seclabel,blkio
36 27 0:33 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:10 - cgroup cgroup rw,seclabel,cpuset
37 27 0:34 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,seclabel,devices
38 27 0:35 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,seclabel,perf_event
39 27 0:36 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,seclabel,hugetlb
40 27 0:37 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,seclabel,memory
41 27 0:38 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,seclabel,net_cls,net_prio
42 27 0:39 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,seclabel,pids

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devimc devimc force-pushed the topic/fixCpuCgroup branch from df502f1 to 37a8448 Compare February 5, 2019 16:17
@devimc
Copy link
Author

devimc commented Feb 5, 2019

/test

@devimc devimc force-pushed the topic/fixCpuCgroup branch 2 times, most recently from eb36e74 to 14e5ea3 Compare February 5, 2019 19:56
@devimc
Copy link
Author

devimc commented Feb 5, 2019

/test

@devimc
Copy link
Author

devimc commented Feb 5, 2019

ping @raravena80

@raravena80
Copy link
Member

@devimc did you mean to ping @WeiZhang555? 😄

@devimc
Copy link
Author

devimc commented Feb 5, 2019

@raravena80 nop, this needs review 😄

Copy link
Member

@raravena80 raravena80 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, I assume it passed all tests. Did you address @WeiZhang555's comment? I believe I'm not an approver.

@devimc
Copy link
Author

devimc commented Feb 5, 2019

@raravena80 thanks

@devimc devimc force-pushed the topic/fixCpuCgroup branch from 14e5ea3 to ca4b9cf Compare February 5, 2019 21:10
@devimc
Copy link
Author

devimc commented Feb 5, 2019

/test

@devimc devimc force-pushed the topic/fixCpuCgroup branch from ca4b9cf to 498c7bc Compare February 6, 2019 18:11
@devimc
Copy link
Author

devimc commented Feb 6, 2019

/test

@devimc devimc force-pushed the topic/fixCpuCgroup branch 2 times, most recently from 89e8893 to 01c86c3 Compare February 6, 2019 21:49
@devimc
Copy link
Author

devimc commented Feb 6, 2019

Shimv2 tests are failing with this change

STEP: stop PodSandbox
E0206 21:45:40.234455   28101 remote_runtime.go:115] StopPodSandbox "762c340ac84f4b554cfbee6b8fd058d9b7dfbd832dd30b77acf6e8b2431a4d54" from runtime service failed: rpc error: code = Unknown desc = failed to stop container "91cd574ea84883dc442e90fb04f1a0af1880437212bf53720825c087f2f3054f": failed to kill container "91cd574ea84883dc442e90fb04f1a0af1880437212bf53720825c087f2f3054f": Could not find the container "91cd574ea84883dc442e90fb04f1a0af1880437212bf53720825c087f2f3054f" from the sandbox "762c340ac84f4b554cfbee6b8fd058d9b7dfbd832dd30b77acf6e8b2431a4d54" containers list: unknown
STEP: delete PodSandbox
E0206 21:45:40.236155   28101 remote_runtime.go:229] StopContainer "91cd574ea84883dc442e90fb04f1a0af1880437212bf53720825c087f2f3054f" from runtime service failed: rpc error: code = Unknown desc = failed to kill container "91cd574ea84883dc442e90fb04f1a0af1880437212bf53720825c087f2f3054f": Could not find the container "91cd574ea84883dc442e90fb04f1a0af1880437212bf53720825c087f2f3054f" from the sandbox "762c340ac84f4b554cfbee6b8fd058d9b7dfbd832dd30b77acf6e8b2431a4d54" containers list: unknown
Feb  6 21:45:40.236: INFO: Unexpected error occurred: rpc error: code = Unknown desc = failed to kill container "91cd574ea84883dc442e90fb04f1a0af1880437212bf53720825c087f2f3054f": Could not find the container "91cd574ea84883dc442e90fb04f1a0af1880437212bf53720825c087f2f3054f" from the sandbox "762c340ac84f4b554cfbee6b8fd058d9b7dfbd832dd30b77acf6e8b2431a4d54" containers list: unknown
E0206 21:45:40.236587   28101 remote_runtime.go:132] RemovePodSandbox "762c340ac84f4b554cfbee6b8fd058d9b7dfbd832dd30b77acf6e8b2431a4d54" from runtime service failed: rpc error: code = Unknown desc = sandbox container "762c340ac84f4b554cfbee6b8fd058d9b7dfbd832dd30b77acf6e8b2431a4d54" is not fully stopped
[AfterEach] [k8s.io] Container
  /tmp/cri-install-crictl.sDoz/src/github.com/kubernetes-sigs/cri-tools/pkg/framework/framework.go:51


• Failure [67.015 seconds]
[k8s.io] Container
/tmp/cri-install-crictl.sDoz/src/github.com/kubernetes-sigs/cri-tools/pkg/framework/framework.go:72
  runtime should support basic operations on container
  /tmp/cri-install-crictl.sDoz/src/github.com/kubernetes-sigs/cri-tools/pkg/validate/container.go:68
    runtime should support stopping container [Conformance] [It]
    /tmp/cri-install-crictl.sDoz/src/github.com/kubernetes-sigs/cri-tools/pkg/validate/container.go:100

    Feb  6 21:45:40.229: stop container "91cd574ea84883dc442e90fb04f1a0af1880437212bf53720825c087f2f3054f" timeout.

I'm trying to debug it but I don't see logs in the journal. @lifupan any thoughts?

@devimc devimc force-pushed the topic/fixCpuCgroup branch from 5dcce06 to 746f6ad Compare February 18, 2019 18:50
@devimc
Copy link
Author

devimc commented Feb 18, 2019

/test

@devimc
Copy link
Author

devimc commented Feb 18, 2019

ping @kata-containers/runtime

@devimc devimc force-pushed the topic/fixCpuCgroup branch from 746f6ad to 107faaf Compare February 18, 2019 21:47
@devimc
Copy link
Author

devimc commented Feb 18, 2019

/test

@devimc
Copy link
Author

devimc commented Feb 19, 2019

@WeiZhang555 @lifupan @jcvenegas please take a look

// Add shim into cgroup
if c.process.Pid > 0 {
if err := cgroup.Add(cgroups.Process{Pid: c.process.Pid}); err != nil {
return fmt.Errorf("Could not add PID %d to cgroup %v: %v", c.process.Pid, spec.Linux.CgroupsPath, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about deleted cgroup if the operation was not successful ? In a defer maybe?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to leave it there and delete it once delete command is used, otherwise we'll face some issues with tools that monitor the cgroups

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I wonder should not delay c.state.CgroupPath = spec.Linux.CgroupsPath to the end ? If return intermediately we dont save the path in the state.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, fixing, thanks

Julio Montes added 4 commits February 19, 2019 13:13
cpu cgroups are container's specific hence all containers even the sandbox
should be able o create, delete and update their cgroups. The cgroup crated
matches with the cgroup path passed by the containers manager.

fixes kata-containers#1117
fixes kata-containers#1118
fixes kata-containers#1021

Signed-off-by: Julio Montes <[email protected]>
All containers run in different cgroups even the sandbox, with this new
implementation the sandbox cpu cgroup wil be equal to the sum of all its
containers and the hypervisor process will be placed there impacting to the
containers running in the sandbox (VM). The default number of vcpus is
used when the sandbox has no constraints. For example, if default_vcpus
is 2, then quota will be 200000 and period 100000.

**c-ray test**
http://www.futuretech.blinkenlights.nl/c-ray.html

```
+=============================================+
|         | 6 threads 6cpus | 1 thread 1 cpu  |
+=============================================+
| current |   40 seconds    |   122 seconds   |
+==============================================
|   new   |   37 seconds    |   124 seconds   |
+==============================================
```

current = current cgroups implementation
new = new cgroups implementation

**workload**

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: c-ray
  annotations:
    io.kubernetes.cri.untrusted-workload: "true"
spec:
  restartPolicy: Never
  containers:
  - name: c-ray-1
    image: docker.io/devimc/c-ray:latest
    imagePullPolicy: IfNotPresent
    args: ["-t", "6", "-s", "1600x1200", "-r", "8", "-i",
          "/c-ray-1.1/sphfract", "-o", "/tmp/output.ppm"]
    resources:
      limits:
        cpu: 6
  - name: c-ray-2
    image: docker.io/devimc/c-ray:latest
    imagePullPolicy: IfNotPresent
    args: ["-t", "1", "-s", "1600x1200", "-r", "8", "-i",
          "/c-ray-1.1/sphfract", "-o", "/tmp/output.ppm"]
    resources:
      limits:
        cpu: 1
```

fixes kata-containers#1153

Signed-off-by: Julio Montes <[email protected]>
container is killed by force, container's state MUST change its state to stop
immediately to avoid leaving it in a bad state.

fixes kata-containers#1088

Signed-off-by: Julio Montes <[email protected]>
To avoid long timeouts, the runtime shouldn't try to talk with the proxy
when it's not running.

Signed-off-by: Julio Montes <[email protected]>
@devimc devimc force-pushed the topic/fixCpuCgroup branch from 107faaf to 58d2785 Compare February 19, 2019 19:15
@jcvenegas
Copy link
Member

lgtm

@jcvenegas
Copy link
Member

/test

@GabyCT GabyCT merged commit 60f7c4f into kata-containers:master Feb 20, 2019
@sboeuf
Copy link

sboeuf commented Feb 20, 2019

Oh no... I wanted to review this, and I was expecting more approval for this PR since that is a pretty important/big change...

// container was killed by force, container MUST change its state
// as soon as possible just in case one of below operations fail leaving
// the containers in a bad state.
if err := c.setContainerState(types.StateStopped); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it related to the PR? I'm not saying this is not fixing something, but it's been hidden into the PR that is supposed to fix cgroups, not the lifecycle of a container. This is the type of changes that can have a big impact and that is good to get reviewed standalone through its own PR.

@@ -1569,6 +1569,13 @@ func (k *kataAgent) sendReq(request interface{}) (interface{}, error) {
span.SetTag("request", request)
defer span.Finish()

if k.state.ProxyPid > 0 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, I don't think this is related to cgroups, and this should not be part the same PR.

}

func (q *qemu) pid() int {
data, err := ioutil.ReadFile(q.pidFile())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PID has been stored through its own file? Don't you think we could improve this by having the PID part of the qemu.state structure? This would be saved as part of the state.json file associated with the VM, and it would avoid yet another file to be created.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] [Proposal] support cgroups in the host
7 participants