-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runc 1.1.15 OOMs in Kubernetes e2e tests with containerd, cgroup v2, and cgroupfs driver #4427
Comments
I'm not sure why it works with containerd 1.7, but this OOM issue with Kubernetes e2e tests is something we know about. This is caused by #4392 (backport of #3931) which removed the code we added in #1984 to fix the same failing Kubernetes e2e test (reported in #1980). This code was removed because the bindfd logic is not as secure as using memfds and was causing significant performance problems on systems with lots of container churn. When I re-investigated the e2e issue (when working on #3931) I figured out that the OOM is happening because Kubernetes/kubelet/runc(?) is being run in a cgroup with a very low memory limit (this is not caused by the container memory limit as incorrectly claimed in #1980 because the extra memory used by memfds is never accounted by the container cgroup). My suggestion is to change the Kubernetes e2e test to run Kubernetes in a cgroup with a more reasonable memory limit. I'm not quite sure what the purpose of a test where you run the entire Kubernetes/kubelet stack with such a tiny memory limit is -- sure, containers with a small memory limit makes some sense, but why run the management code in such a constrained setup? |
Do you have a link to where containerd or runc is being constrained to a low cgroup memory limit? containerd is run as a normal systemd unit with no memory limit and is itself responsible for directly launching runc (so runc should start inside containerd's own cgroup). |
I'm not super familiar with Kubernetes e2e tests, but I believe func runOomKillerTest(f *framework.Framework, testCase testCase, kubeReservedMemory float64) {
ginkgo.Context(testCase.name, func() {
// Update KubeReservedMemory in KubeletConfig.
if kubeReservedMemory > 0 {
tempSetCurrentKubeletConfig(f, func(ctx context.Context, initialConfig *kubeletconfig.KubeletConfiguration) {
if initialConfig.KubeReserved == nil {
initialConfig.KubeReserved = map[string]string{}
}
// There's a race condition observed between system OOM and cgroup OOM if node alocatable
// memory is equal to node capacity. Hence, reserving a fraction of node's memory capacity for
// K8s components such that node allocatable memory is less than node capacity to
// observe OOM kills at cgroup level instead of system OOM kills.
initialConfig.KubeReserved["memory"] = fmt.Sprintf("%d", int(kubeReservedMemory*getLocalNode(context.TODO(), f).Status.Capacity.Memory().AsApproximateFloat64()))
})
}
/* ... */
} The issue here is not the container memory limit, because we know that cannot cause this error (the kernel allocates memory against memcgs based on the cgroup you were in when the memory was allocated -- our |
I don't believe that's correct. My understanding is that |
Yes, kube-reserved memory was used to calculate the node allocatable resources. Kubelet doesn't actually set the cgroup values for it. |
Yes, I saw your explanation before and that makes sense. I have also validated that our CI hosts are running cgroup v2. But I don't think we understand why it is failing; as far as I can tell there are no constraints on the cgroup runc is running in. |
It's EOD here and I'm just getting some progress with this. I'll expand more tomorrow, including the repro I found. It's still not 100% clear to me what is causing this. I can fix the k8s tests locally with this patch to the k8s repo (github.com/kubernetes/kubernetes): diff --git test/e2e_node/oomkiller_linux_test.go test/e2e_node/oomkiller_linux_test.go
index db84cb0f0cf..ebc99abfa19 100644
--- test/e2e_node/oomkiller_linux_test.go
+++ test/e2e_node/oomkiller_linux_test.go
@@ -108,7 +108,8 @@ func runOomKillerTest(f *framework.Framework, testCase testCase, kubeReservedMem
// memory is equal to node capacity. Hence, reserving a fraction of node's memory capacity for
// K8s components such that node allocatable memory is less than node capacity to
// observe OOM kills at cgroup level instead of system OOM kills.
- initialConfig.KubeReserved["memory"] = fmt.Sprintf("%d", int(kubeReservedMemory*getLocalNode(context.TODO(), f).Status.Capacity.Memory().AsApproximateFloat64()))
+ //initialConfig.KubeReserved["memory"] = fmt.Sprintf("%d", int(kubeReservedMemory*getLocalNode(context.TODO(), f).Status.Capacity.Memory().AsApproximateFloat64()))
+ initialConfig.KubeReserved["memory"] = "100Mi"
})
}
I can also make the tests pass changing the memory limits of them from 15Mi to 20Mi. I'm not sure why is that. I guess that the kubelet makes sure that the cgroup it is in has at least the mem limit and that is why that approach also works. However, it is not 100% clear to me yet what the issue is nor what the right fix is. I've also isolated the config.json created for the container (k8s uses two containers, the pause container and the real container, I'm just using the real container as the pause container doesn't have any limits nor anything interesting in the config) and then run it with runc. However, the container runs just fine when runnning it with runc. Due to that, I suspect the issue is as @cyphar was saying, something about the setup that allocates too little mem for the KubeReserved. If I allocate a static "100Mi" for it, I see it pass locally, and if I allocate too litle (like 10Mi), I see it fail for all tests. Please note, though, that when it fails it shows the same error than in the containerd issue comment:
Therefore, what is failing in the test is the exit code. It is expecting 137 but gets 128. When the kernel kills runc[init] it seems it gets 128, but when it kills the "dd" process running in the container, it gets the right exit code. Indeed, it seems the two test that fail locally are because runc[INIT] is being killed by a OOM in a cgroup. It correlates perfectly with what I see in the kernel logs locally:
So, two times the runc init has been killed by the cgroup OOM, the two tests that fail. The other OOM I see in my logs is for the "dd" process (that is the one allocating memory) and that is passing (killing dd with oom gets 137 and not 128 exit code, that makes the test pass). So, it seems the k8s test is setting-up a cgroup with very low mem for runc to run, this is probably due to the KubeReserved thingy, as changing that to a higher number makes the tests pass just fine here. The current allocation is a percentage of the node capacity, as @cyphar pointed out: https://github.com/kubernetes/kubernetes/blob/95ec69c16c76b5ee71fdbebc7a5dea2c39341eb3/test/e2e_node/oomkiller_linux_test.go#L111:
My gut feeling is that the VMs running the CI for containerd 1.6 have less mem than the ones for 1.7 and main, and this calculation gets too low with that VM and the new runc binary is slightly bigger and makes it fail when it seals its on-memory copy of the binary (to fix a CVE, we really want that :)). @samuelkarp some questions for you:
My gut feeling is that this calculation based on percentage is not very good, we should at least add a max between that number and 100Mi or so. But I'd check out later what the right fix might be for this, I'm too tired now :-D |
I found the relevant difference between the containerd 1.6 CI and 1.7 + main CI is which cgroup driver is in use: cgroupfs vs. systemd. When moving containerd 1.6 to use the systemd cgroup driver, the tests no longer fail. |
@samuelkarp +1 to use systemd cgroup driver consistently! |
@dims this isn't quite about consistency. When both Kubelet and containerd are configured to use cgroupfs, the test fails. When both Kubelet and containerd are configured to use the systemd cgroup driver, the test passes. I didn't try to test a mixed mode (I don't think we'd consider that to be supported). |
@samuelkarp ah that sounds bad. |
Using different driver on container runtime and kubelet will not work, as we have seen errors in kubernetes tests when done like that. |
If you have time to share the information you found, that would very much be appreciated 😄.
All the jobs are running on GCE e2-medium VMs with 1 vCPU (2 fractional vCPUs) and 4 GiB of memory.
I'll see if we can find answers for these; I'm a containerd maintainer not a Kubernetes maintainer so I'm less familiar with how these tests are run (through the Kubernetes test infrastructure, and Kubernetes test source code). But the VM sizes are the same for 1.6, 1.7, and main, so they should all have the same kubeReserved calculation. I'm not sure if you saw above, but I did find a difference:
We don't currently have cgroup v1 test coverage in the Kubernetes e2e/containerd CI jobs; fixing that is a separate issue. Does runc have different codepaths depending on whether cgroupfs or the systemd cgroup driver is in use? I see a
Yep, I am familiar and agree that the CVE should remain mitigated. I also am familiar with the impact to systemd for the bindfd mitigation, so I understand why that's undesirable as well. |
We should join the cgroup after the initial setup finished, but before runc init clone new children processes. (opencontainers#4427) Because we should try our best to reduce the influence of memory cgroup accounting from all runc init processes before we start the container init process. Signed-off-by: lifubang <[email protected]>
Yes, if this can be changed in kubernetes e2e test, it will be better. The core reason is that the memory accounting should not include Please consider whether this is worth to draft a new release 1.1.16 or not? |
We should join the cgroup after the initial setup finished, but before runc init clone new children processes. (opencontainers#4427) Because we should try our best to reduce the influence of memory cgroup accounting from all runc init processes before we start the container init process. Signed-off-by: lifubang <[email protected]> (cherry picked from commit 4f6000e) Signed-off-by: lifubang <[email protected]>
Could you help to test #4439 in containerd 1.6? I have test it in my machine, it works fine:
root@acmcoder:/opt/ubuntu# cat config.json | jq .process.args
[
"cat",
"/sys/fs/cgroup/memory.peak",
"/sys/fs/cgroup/memory.max",
"/sys/fs/cgroup/cgroup.procs"
]
root@acmcoder:/opt/ubuntu# cat config.json | jq .linux.resources.memory
{
"limit": 10485760
} So, the first line of output from the container is the peak usage of memory,
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 -v
runc version 1.1.15
commit: v1.1.15-0-gbc20cb44
spec: 1.0.2-dev
go: go1.22.3
libseccomp: 2.5.5
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
5681152
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
2572288
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
8728576
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
5181440
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
11657216
10485760
1
# peak > max
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
2629632
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1.15 run test
11231232
10485760
1
# peak > max
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev -v
runc version 1.1.15+dev
commit: v1.1.13-40-gea5f8e04
spec: 1.0.2-dev
go: go1.23.2
libseccomp: 2.5.5
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2789376
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2916352
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
3239936
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2711552
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2662400
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
3051520
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2797568
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
3293184
10485760
1
root@acmcoder:/opt/ubuntu# ./runc.1.1-dev run test
2789376
10485760
1 |
And I think there is no this situation in the main branch, because the binary clone operation has been moved from |
The test with the #4439 fix has passed for both cgroupfs and systemd |
@samuelkarp @akhilerm @lifubang Great findings, thanks! I think I found a bug in the PR @lifubang created (commented there). But when trying to fix it, I think I might have found another bug that was there for a while. More details and links below :) Repro the k8s test failure locallyBut first things first. Let me share my setup to run the kubernetes tests locally, as it is definitely not trivial to find (or it wasn't for me :D). Checkout the containerd repo, start containerd running as root. I'm just doing this:
The config doesn't seem to be special. I'm using version 2 and just changed from the default (generated with
I tried chaning the systemd_cgroup to true/false, but it didn't seem to make a difference. My config is here, just in case: https://paste.debian.net/1331999/ I'm using containerd in this commit 906c23218c21373961bae0665a1dec6b89852fc2. Although I don't think this is relevant either. Then, checkout the kubernetes repo (github.com/kubernetes/kuberentes). I'm using this commit, although it doesn't seem to matter either: d9c46d8ecb1ede9be30545c9803e17682fcc4b50. Then, do this from the repo root:
Here it doesn't fail reliably (but it never takes more than 3 tries to hit the fail) and that is what confused me in the previous message. In the CI does fail/pass reliably, but not locally for me, which I didn't expect. Also, while debugging the issue locally, I created a bash script to wrap runc. I printed the content of The runc-wrapper script is just this simple script, just in case:
Fixing this issueIn my previous message I thought the kubelet was creating a cgroup to run runc and that was failing. It seems the kubelet can do that (for the kubeReserved, etc.) but it is not doing it in the tests that fail. The fact that fails reliably on CI and not locally, plus that it was kind of late here, confused me the other day. Let's see if this time it is better, as it's also getting late now :D. @lifubang great findings! I think that fix sounds reasonable at first sight, but it might be introducing a bug (explained it here). I've experimented with a way to fix it here: https://github.com/rata/runc/commits/runc-1.1-mem-issue/. EDIT: The race doesn't seem possible, because nsexec does another read (a lot of them!) before exec'ing into the user process. So this is safe, the comment in the code is about applying the cgroup before sending the bootstrapData, so we sync on that. Sorry for the noise, I leave below what race I thought there was. While doing it, I realized we are not waiting to join the cgroup for the setns case (we were only doing it for the initProcess case). In other words, on the setns case, the nsexec.c might continue and create the process before we apply it to the cgroup (we are not syncing on that, although it is unliklely to win that race in practice). My branch there changes that too. It seems to me this lack of sync was on purpose, but I can't see why now. Any thoughts? In fact, I'd need to check if in (*initProcess).start(), the var If that is the case, though, my branch is fixing it in any case, as we are doing a read on nsexec now (that will block until there is something to read). My branch is a work in progress, not ready for a PR yet. But enough to play and experiment further on this. Am I missing something? @cyphar can you take a look at my branch regarding the nsexec races I explained above? It's not clear to me yet why on CI only the cgroupfs driver seems to be affected, if anyone have any thoughts on this, please let me know :) Some TODOs so I don't forget:
|
Unfortunately, #4399 will influence runc's performance, please see #4439 (comment) 100 containers test result: But we should know that patch 4399 has the same start time usage with In another words, we should also have to find a solution to reduce the runc start time usage for |
I strongly recommend to change the Kubernetes e2e test to run Kubernetes in a cgroup with a more reasonable memory limit. For example: 20M. |
@rata is |
@tianon great catch! It's a typo, I adjusted to remove some other stuff I have, sorry :D |
@lifubang How did you test the container time? Just a bash for loop and the time it takes to do all of them? Can you share, just in case? Also, 1100ms difference on 100 containers is just 11ms increase time for each creation. And if it's on-par with main, is this really a big issue? Or am I again too tired now? :D |
Yes.
I can share it tomorrow, because it's in my another PC. My step to test:
#!/bin/bash
ver=$1
for i in {1..100}; do
./runc-$ver run test
done
|
The bump to runc 1.1.15 caused this test to fail consistently. There is an existing issue about k8s tests also failing, with the suggestion that the memory limit be increased: opencontainers/runc#4427 In addition, a similar commit happened ~6 years ago: e4df8fb We may be able to revert this change once more is understood about the runc issue.
We should join the cgroup after the initial setup finished, but before runc init clone new children processes. (opencontainers#4427) Because we should try our best to reduce the influence of memory cgroup accounting from all runc init processes before we start the container init process. Signed-off-by: lifubang <[email protected]>
I am hitting this issue randomly in my local setup, when using vagrant boxes with systemd while running CRI tests. Tried using the exact same steps as used in the containerd CI. |
This seems to also affect other containerd branches, not just 1.6 with cgroupfs. See this from main (main is not using cgroupfs):
IMHO, we should release 1.1.16 as soon as we merge #4439 None situation is ideal, either without this fix or with this fix. I think these test's mem limits are kind of artificial and probably not super relevant in real life, but it is breaking the CI for users, so it is a real problem. On the other hand, with #4439, the memory of the clone will go to the main runc process (so outside the container cgroup) and maybe real setups might have an issue with that (like maybe people using kube-reserved and a lot of churn? or kubernetes clusters with readiness probes that use exec + using kube-reserved?). For now the latter is a theretical idea that we don't know it will happen or not, so IMHO let's just fix the real problems that we have, as #4439 does, and if this is a problem for users, of course we can revert and ask projects to adjusts the CI tests. |
This reverts commit 113a9f1. runc 1.1.15 appears to have incresed chances for causing OOMs for containers with small memory limits. Revert the change in containerd 1.7 to unblock releases while the upstream runc issue is resolved. Dependency-issue: opencontainers/runc#4427 Signed-off-by: Samuel Karp <[email protected]>
This reverts commit f0f1bfc. runc 1.1.15 appears to have incresed chances for causing OOMs for containers with small memory limits. Revert the change in containerd to unblock CI while the upstream runc issue is resolved. Dependency-issue: opencontainers/runc#4427 Signed-off-by: Samuel Karp <[email protected]>
This reverts commit f0f1bfc. runc 1.1.15 appears to have incresed chances for causing OOMs for containers with small memory limits. Revert the change in containerd to unblock CI while the upstream runc issue is resolved. Dependency-issue: opencontainers/runc#4427 Signed-off-by: Samuel Karp <[email protected]>
I see runc 1.2.0 was released. A few questions:
|
@samuelkarp thanks, I wanted to update here.
|
Apologies for the delay; I was away for a few days.
|
Description
In containerd/containerd#10795 we're attempting to update containerd 1.6 to use runc 1.1.15. This triggers a Kubernetes e2e CI run, which is consistently failing. The same update to runc 1.1.15 succeeds with containerd 1.7 and
main
(which will be 2.0).Steps to reproduce the issue
We're currently attempting to narrow down differences between the test environments used in containerd's various branches, since it seems unlikely the actual containerd version will be the cause of the difference. So far, containerd 1.6's tests are using the cgroupfs driver for managing cgroups as opposed to systemd, while the 1.7 and main branch tests use systemd to manage cgroups.
All tests are running on COS M117 and appear to be using cgroup v2 as of now.
Describe the results you received and expected
What version of runc are you using?
1.1.15
Host OS information
No response
Host kernel information
No response
The text was updated successfully, but these errors were encountered: