-
Notifications
You must be signed in to change notification settings - Fork 375
virtcontainers: reimplement sandbox cgroup #1189
Conversation
009a12a
to
2da5814
Compare
@WeiZhang555 please take a look |
@jcvenegas please take a look |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort! I left some comments, great work!
for _, i := range tids.vcpus { | ||
if i <= 0 { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding only the vcpus
threads is an optimization, which can benefit to small chunks IO performance according to our previous test.
It's OK to remove this if it makes the handling logic too complicated, I'm just wishing to share the performance tuning result to community.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure I can check that, what tests did you run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can test with some network benchmark tools such as netperf, or disk IO benchmark tools such as FIO or iozone. Try to send/receive with small chunks(e.g. 4k package), for one VM with 1 cpu, you can compare W/ or W/O "cpu cgroup of 1 cpu constraint", I believe you can see how the performance is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// first move all processes in subgroup to parent in case live process blocks | ||
// deletion of cgroup | ||
if err := s.cgroup.sandboxSub.MoveTo(s.cgroup.commonParent); err != nil { | ||
return fmt.Errorf("failed to clear cgroup processes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my previous test, a left-over process (zoombie or removed too slowly) can block the cgroup's deletion which will cause cgroup dir left on disk after sandbox destroyed.
Have you tested this case? I don't see similar code in deleteCgroups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a simple test, and seems like cgroups are removed properly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed lefacy cgroup occasinally before, without specified test case.
I think you can emulate the behaviour by adding some other pid from host OS to that cgroup, and when you delete the container, the cgroup dir will be left. This is a easy emulation not the real case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, cgroups are removed, I filed an issue to test it once this pr land
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// CgroupPath is the cgroup hierarchy where sandbox's processes | ||
// including the hypervisor are placed. | ||
CgroupPath string `json:"cgroupPath,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can bring potential broken compatibility.
Suppose sandboxA is created by old kata-runtime and runs for some time, after we upgrade kata-runtime, and when using the new kata-runtime to delete the old container, it will search state.json
file from disk and end up to find nothing, The cgroup dir can fail to delete and garbage left.
That's why https://github.com/kata-containers/runtime/pull/883
is important.
The influence is not that big caused by this modification, so I can accept the modification so far.
2da5814
to
747636e
Compare
@WeiZhang555 changes applied, thanks |
747636e
to
df502f1
Compare
func (c *Container) newCgroups() error { | ||
ann := c.GetAnnotations() | ||
|
||
config, ok := ann[annotations.ConfigJSONKey] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use in many places the config json, at some point better to move it from annotations, virtcontainers is not agnostic at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. We can definitely save the config in Container
. This way we can save many disk read and json Unmarshaling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already tested cpuset with k8s and docker they are working as I expected, I added a quick review, I'll do a second round later.
sandboxSub, err := parent.New(s.id, &specs.LinuxResources{}) | ||
// V1Constraints returns the cgroups that are compatible with th VC architecture | ||
// and hypervisor, constraints can be applied to these cgroups. | ||
func V1Constraints() ([]cgroups.Subsystem, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont fully get the naming of the cunftions it is because of cgroups v1? I prefer remove that prefix if does not add any information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is based on containerd cgroups https://github.com/kata-containers/runtime/blob/master/vendor/github.com/containerd/cgroups/v1.go#L28
fields = strings.Split(text, " ") | ||
// safe as mountinfo encodes mountpoints with spaces as \040. | ||
index = strings.Index(text, " - ") | ||
postSeparatorFields = strings.Fields(text[index+3:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy&paste of a containerd private function https://github.com/kata-containers/runtime/blob/master/vendor/github.com/containerd/cgroups/v1.go#L49
if numPostFields == 0 { | ||
return "", fmt.Errorf("Found no fields post '-' in %q", text) | ||
} | ||
if postSeparatorFields[0] == "cgroup" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index access without check len
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | ||
// this is an error as we can't detect if the mount is for "cgroup" | ||
if numPostFields == 0 { | ||
return "", fmt.Errorf("Found no fields post '-' in %q", text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really useful error (at least for me)
} | ||
if postSeparatorFields[0] == "cgroup" { | ||
// check that the mount is properly formated. | ||
if numPostFields < 3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if numPostFields < 3 { | ||
return "", fmt.Errorf("Error found less than 3 fields post '-' in %q", text) | ||
} | ||
return filepath.Dir(fields[4]), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index access with out check length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return filepath.Dir(fields[4]), nil | ||
} | ||
} | ||
return "", cgroups.ErrMountPointNotExist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if cgroups api does not provide a function like this.
From this file what is string is supose to return ?
cat /proc/self/mountinfo | grep cgroup
27 20 0:24 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:4 - tmpfs tmpfs ro,seclabel,mode=755
28 27 0:25 / /sys/fs/cgroup/unified rw,nosuid,nodev,noexec,relatime shared:5 - cgroup2 cgroup2 rw,seclabel,nsdelegate
29 27 0:26 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,seclabel,xattr,name=systemd
33 27 0:30 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:7 - cgroup cgroup rw,seclabel,cpu,cpuacct
34 27 0:31 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:8 - cgroup cgroup rw,seclabel,freezer
35 27 0:32 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:9 - cgroup cgroup rw,seclabel,blkio
36 27 0:33 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:10 - cgroup cgroup rw,seclabel,cpuset
37 27 0:34 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,seclabel,devices
38 27 0:35 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,seclabel,perf_event
39 27 0:36 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,seclabel,hugetlb
40 27 0:37 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,seclabel,memory
41 27 0:38 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,seclabel,net_cls,net_prio
42 27 0:39 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,seclabel,pids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df502f1
to
37a8448
Compare
/test |
eb36e74
to
14e5ea3
Compare
/test |
ping @raravena80 |
@devimc did you mean to ping @WeiZhang555? 😄 |
@raravena80 nop, this needs review 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, I assume it passed all tests. Did you address @WeiZhang555's comment? I believe I'm not an approver.
@raravena80 thanks |
14e5ea3
to
ca4b9cf
Compare
/test |
ca4b9cf
to
498c7bc
Compare
/test |
89e8893
to
01c86c3
Compare
Shimv2 tests are failing with this change
I'm trying to debug it but I don't see logs in the journal. @lifupan any thoughts? |
5dcce06
to
746f6ad
Compare
/test |
ping @kata-containers/runtime |
746f6ad
to
107faaf
Compare
/test |
@WeiZhang555 @lifupan @jcvenegas please take a look |
// Add shim into cgroup | ||
if c.process.Pid > 0 { | ||
if err := cgroup.Add(cgroups.Process{Pid: c.process.Pid}); err != nil { | ||
return fmt.Errorf("Could not add PID %d to cgroup %v: %v", c.process.Pid, spec.Linux.CgroupsPath, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about deleted cgroup if the operation was not successful ? In a defer maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to leave it there and delete it once delete
command is used, otherwise we'll face some issues with tools that monitor the cgroups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I wonder should not delay c.state.CgroupPath = spec.Linux.CgroupsPath
to the end ? If return intermediately we dont save the path in the state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, fixing, thanks
cpu cgroups are container's specific hence all containers even the sandbox should be able o create, delete and update their cgroups. The cgroup crated matches with the cgroup path passed by the containers manager. fixes kata-containers#1117 fixes kata-containers#1118 fixes kata-containers#1021 Signed-off-by: Julio Montes <[email protected]>
All containers run in different cgroups even the sandbox, with this new implementation the sandbox cpu cgroup wil be equal to the sum of all its containers and the hypervisor process will be placed there impacting to the containers running in the sandbox (VM). The default number of vcpus is used when the sandbox has no constraints. For example, if default_vcpus is 2, then quota will be 200000 and period 100000. **c-ray test** http://www.futuretech.blinkenlights.nl/c-ray.html ``` +=============================================+ | | 6 threads 6cpus | 1 thread 1 cpu | +=============================================+ | current | 40 seconds | 122 seconds | +============================================== | new | 37 seconds | 124 seconds | +============================================== ``` current = current cgroups implementation new = new cgroups implementation **workload** ```yaml apiVersion: v1 kind: Pod metadata: name: c-ray annotations: io.kubernetes.cri.untrusted-workload: "true" spec: restartPolicy: Never containers: - name: c-ray-1 image: docker.io/devimc/c-ray:latest imagePullPolicy: IfNotPresent args: ["-t", "6", "-s", "1600x1200", "-r", "8", "-i", "/c-ray-1.1/sphfract", "-o", "/tmp/output.ppm"] resources: limits: cpu: 6 - name: c-ray-2 image: docker.io/devimc/c-ray:latest imagePullPolicy: IfNotPresent args: ["-t", "1", "-s", "1600x1200", "-r", "8", "-i", "/c-ray-1.1/sphfract", "-o", "/tmp/output.ppm"] resources: limits: cpu: 1 ``` fixes kata-containers#1153 Signed-off-by: Julio Montes <[email protected]>
container is killed by force, container's state MUST change its state to stop immediately to avoid leaving it in a bad state. fixes kata-containers#1088 Signed-off-by: Julio Montes <[email protected]>
To avoid long timeouts, the runtime shouldn't try to talk with the proxy when it's not running. Signed-off-by: Julio Montes <[email protected]>
107faaf
to
58d2785
Compare
lgtm |
/test |
Oh no... I wanted to review this, and I was expecting more approval for this PR since that is a pretty important/big change... |
// container was killed by force, container MUST change its state | ||
// as soon as possible just in case one of below operations fail leaving | ||
// the containers in a bad state. | ||
if err := c.setContainerState(types.StateStopped); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is it related to the PR? I'm not saying this is not fixing something, but it's been hidden into the PR that is supposed to fix cgroups, not the lifecycle of a container. This is the type of changes that can have a big impact and that is good to get reviewed standalone through its own PR.
@@ -1569,6 +1569,13 @@ func (k *kataAgent) sendReq(request interface{}) (interface{}, error) { | |||
span.SetTag("request", request) | |||
defer span.Finish() | |||
|
|||
if k.state.ProxyPid > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here, I don't think this is related to cgroups, and this should not be part the same PR.
} | ||
|
||
func (q *qemu) pid() int { | ||
data, err := ioutil.ReadFile(q.pidFile()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PID has been stored through its own file? Don't you think we could improve this by having the PID part of the qemu.state
structure? This would be saved as part of the state.json
file associated with the VM, and it would avoid yet another file to be created.
All containers run in different cgroups even the sandbox, with this new
implementation the sandbox cpu cgroup wil be equal to the sum of all its
containers and the hypervisor process will be placed there impacting to the
containers running in the sandbox (VM). The default number of vcpus is
used when the sandbox has no constraints. For example, if default_vcpus
is 2, then quota will be 200000 and period 100000.
c-ray test
http://www.futuretech.blinkenlights.nl/c-ray.html
current = current cgroups implementation
new = new cgroups implementation
workload
fixes #1125
Signed-off-by: Julio Montes [email protected]