-
Notifications
You must be signed in to change notification settings - Fork 39.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cgroup leaking, no space left on /sys/fs/cgroup #70324
Comments
@kubernetes/sig-node-bugs |
@c-nuro: Reiterating the mentions to trigger a notification: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
docker version ? |
Have both docker 17.03.2ce and 18.03, and happened on both. I do see cadvisor uses system.slice/run-*.scope like cgroups and monitor them, not not sure who is the creator. If I manually delete these cgroups, only relevant syslog is from cadvisor complaining file not found. If I ran a container with bare docker 17.03.2, I don’t see such pattern of cgroup created. The process runs under cgroup /docker/c9e9cd94a31fef655d56680796631b075922dbcd514f4f6e67667e203b591b5f |
Hi, i have the same issue. The issue leads to k8s nodes notReady and pods does not run fine. Restart docker did not help me, but after server it works fine. It happens several times. Anyone have good ideas to help me to fix this issue. Thanks in advance. |
I noticed the pattern of cgroup used for mounting pod volumes. The error from rpcbind is unrelated to this issue, but the output saying it's running with this pattern of cgroup. Can someone who works on volume mounts take a look?
|
Have the same problem. K8S 12.1 raw.go:146] Failed to watch directory "/sys/fs/cgroup/devices/system.slice/grub-common.service": inotify_add_watch /sys/fs/cgroup/devices/system.slice/grub-common.service: no space left on device |
the same issue. |
I have the same question. |
Is there a real workaround for this? My cloud provider is telling me the only cure is to restart each cluster node on a daily basis.. Any help appreciated! |
This issue might be connected to google/cadvisor#1581 If you take a closer look, you can verify that the problem is inside the function inotify_add_watch. The default inotify limit on ubuntu is 8192, which can be the limiting factor here. $ sudo sysctl fs.inotify.max_user_watches=524288 After that I kept watching journalctl -f In my case the error messages disappered. @c-nuro Can you test it on your system? EDIT:
|
If you happen to have flexvolume plugin that has pvc mounted under /var/lib/kubelet/plugin, #74669 might be what is exhausting watchers. |
Setting sysctl fs.inotify.max_user_watches=524288 seems to have solved the issue for now for me. We use flexVolume. Any news on a permanent fix for this? |
So, be careful, i guess in some solutions solution above ^ of increasing watch allotment , can be a bandaid that might cause #64137 to occur, ironically. Hence cross referencing these issues to each other as they are closely related (that is, i think certain types of Cgroup leaking is closely related it seems to kubelet CPU hogging) ... for specs, seeing this on 40 core, centos hardware. |
+1 we hit this issue the other day. having just 8192 inotify watches (Kubernetes 1.12.5 on Azure AKS Ubuntu 16.04) seems extremely low. The only viable option here is the DaemonSet as per @jeff1985 , although it would help to understand what exactly is eating up the watches. |
@mariojacobo so what i did in the end, is to implement a stateful set with a simple sleep inside, which would execute the job and then wait until the next interval occures. this seems to be more friendly to the cluster as having the cluster to spin up a new container each time. |
@jeff1985 we are using a DaemonSet, not Cronjob. I thought about Cronjobs initially, but they dont'get scheduled on all the nodes, especially with autoscaling enabled. I tweaked a bit your yaml file and it's working just fine. Our main concern is not being able to tell what's eating up all the inotify watches. |
So, we're also having this problem and via another post there's an inotify_watcher.sh script #!/usr/bin/env bash
#
# Copyright 2018 (c) Yousong Zhou
#
# This script can be used to debug "no space left on device due to inotify
# "max_user_watches" limit". It will output processes using inotify methods
# for watching file system activities, along with HOW MANY directories each
# inotify fd watches
#
# A temporary method of working around the said issue above, tune up the limit.
# It's a per-user limit
#
# sudo sysctl fs.inotify.max_user_watches=81920
#
# In case you also wonder why "sudo systemctl restart sshd" notifies inotify
# errors, it's from blue systemd-tty-ask-password-agent
#
# execve("/usr/bin/systemd-tty-ask-password-agent", ["/usr/bin/systemd-tty-ask-passwor"..., "--watch"], [/* 16 vars */]) = 0
# inotify_init1(O_CLOEXEC) = 4
# inotify_add_watch(4, "/run/systemd/ask-password", IN_CLOSE_WRITE|IN_MOVED_TO) = -1 ENOSPC (No space left on device)
#
# Sample output
#
# [yunion@titan yousong]$ sudo bash a.sh | column -t
# systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/10 1
# systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/14 4
# systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/20 4
# systemd-udevd /usr/lib/systemd/systemd-udevd 689 /proc/689/fdinfo/7 4
# NetworkManager /usr/sbin/NetworkManager 914 /proc/914/fdinfo/10 5
# NetworkManager /usr/sbin/NetworkManager 914 /proc/914/fdinfo/11 4
# crond /usr/sbin/crond 939 /proc/939/fdinfo/5 3
# rsyslogd /usr/sbin/rsyslogd 1212 /proc/1212/fdinfo/3 2
# kube-controller /usr/bin/kube-controller-manager 4934 /proc/4934/fdinfo/8 1
# kubelet /usr/bin/kubelet 4955 /proc/4955/fdinfo/12 0
# kubelet /usr/bin/kubelet 4955 /proc/4955/fdinfo/17 1
# kubelet /usr/bin/kubelet 4955 /proc/4955/fdinfo/26 51494
# journalctl /usr/bin/journalctl 13151 /proc/13151/fdinfo/3 2
# sdnagent /opt/yunion/bin/sdnagent 20558 /proc/20558/fdinfo/7 90
# systemd-udevd /usr/lib/systemd/systemd-udevd 46019 /proc/46019/fdinfo/7 4
# systemd-udevd /usr/lib/systemd/systemd-udevd 46020 /proc/46020/fdinfo/7 4
#
# The script is adapted from https://stackoverflow.com/questions/13758877/how-do-i-find-out-what-inotify-watches-have-been-registered/48938640#48938640
#
set -o errexit
set -o pipefail
lsof +c 0 -n -P -u root \
| awk '/inotify$/ { gsub(/[urw]$/,"",$4); print $1" "$2" "$4 }' \
| while read name pid fd; do \
exe="$(readlink -f /proc/$pid/exe || echo n/a)"; \
fdinfo="/proc/$pid/fdinfo/$fd" ; \
count="$(grep -c inotify "$fdinfo" || true)"; \
echo "$name $exe $pid $fdinfo $count"; \
done Output of a system which experiencing this issue is...
So something in kubelet is watching a lot of files.. almost 72k to be exact! A comparison from another host which is behaving - its < 1 k
What I did notice is that after kubelet successfully re-started(after i increase fs.inotify.max_user_watches=524288, which is very excessive IMO) and I restarted a pod which was in a bad state, over time the watches decreased significantly and this is the same output ~10 mins laster sh inotify_watchers.sh
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/10 1
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/14 4
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/20 4
systemd-udevd /usr/lib/systemd/systemd-udevd 5029 /proc/5029/fdinfo/7 9
NetworkManager /usr/sbin/NetworkManager 9874 /proc/9874/fdinfo/10 5
NetworkManager /usr/sbin/NetworkManager 9874 /proc/9874/fdinfo/11 4
crond /usr/sbin/crond 9909 /proc/9909/fdinfo/5 3
rsyslogd /usr/sbin/rsyslogd 10275 /proc/10275/fdinfo/3 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/49 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/51 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/53 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/55 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/57 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/59 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/61 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/63 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/65 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/67 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/69 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/80 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/83 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/84 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/118 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/120 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/154 2
kubelet /usr/local/bin/kubelet 42383 /proc/42383/fdinfo/6 1
kubelet /usr/local/bin/kubelet 42383 /proc/42383/fdinfo/7 0
kubelet /usr/local/bin/kubelet 42383 /proc/42383/fdinfo/10 352
kubelet /usr/local/bin/kubelet 42383 /proc/42383/fdinfo/16 1 What I don't know how to trace is what caused the huge spike in the kubelet pid/inotify_watchers? |
@reaperes's systemd cgroup cleanup code from #64137 seemed the cleanest and most surgical of all the workarounds that I've found documented for this and the related issues, so I've converted it into a DaemonSet that runs the fix hourly on every node in a cluster. You could set any interval that you like, of course, but the script isn't very resource intensive and hourly seemed reasonable. It actually takes about a day or so for the CPU loading to become noticeable in my cluster and a week or so for it to crash a node. I've been running this for a few days now in my staging cluster and it appears to keep the CPU loading under control. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
The daemonset workaround with the surgical cleanup that I posted didn't end up being enough in the end. There's still a small leak that it doesn't take care of that eventually still takes the system down anyhow. We've taken to rebooting the servers nightly to work around this issue. Rather a medieval solution for such a sophisticated tool. |
/remove-lifecycle stale |
This change implements a cleaner, that scans for cgroups created by systemd-run --scope that do not have any pids assigned, indicating that the cgroup is unused and should be cleaned up. On some systems either due to systemd or the kernel, the scope is not being cleaned up when the pids within the scope have completed execution, leading to an eventual memory leak. Kubernetes uses systemd-run --scope when creating mount points, that may require drivers to be loaded/running in a separate context from kubelet, which allows the above leak to occur. kubernetes/kubernetes#70324 kubernetes/kubernetes#64137 gravitational/gravity#1219
* Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run --scope that do not have any pids assigned, indicating that the cgroup is unused and should be cleaned up. On some systems either due to systemd or the kernel, the scope is not being cleaned up when the pids within the scope have completed execution, leading to an eventual memory leak. Kubernetes uses systemd-run --scope when creating mount points, that may require drivers to be loaded/running in a separate context from kubelet, which allows the above leak to occur. kubernetes/kubernetes#70324 kubernetes/kubernetes#64137 gravitational/gravity#1219 * change logging level for cgroup cleanup * address review feedback * address review feedback
* Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run --scope that do not have any pids assigned, indicating that the cgroup is unused and should be cleaned up. On some systems either due to systemd or the kernel, the scope is not being cleaned up when the pids within the scope have completed execution, leading to an eventual memory leak. Kubernetes uses systemd-run --scope when creating mount points, that may require drivers to be loaded/running in a separate context from kubelet, which allows the above leak to occur. kubernetes/kubernetes#70324 kubernetes/kubernetes#64137 gravitational/gravity#1219 * change logging level for cgroup cleanup * address review feedback * address review feedback (cherry picked from commit 00ed8e6)
* Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run --scope that do not have any pids assigned, indicating that the cgroup is unused and should be cleaned up. On some systems either due to systemd or the kernel, the scope is not being cleaned up when the pids within the scope have completed execution, leading to an eventual memory leak. Kubernetes uses systemd-run --scope when creating mount points, that may require drivers to be loaded/running in a separate context from kubelet, which allows the above leak to occur. kubernetes/kubernetes#70324 kubernetes/kubernetes#64137 gravitational/gravity#1219 * change logging level for cgroup cleanup * address review feedback * address review feedback (cherry picked from commit 00ed8e6)
* Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run --scope that do not have any pids assigned, indicating that the cgroup is unused and should be cleaned up. On some systems either due to systemd or the kernel, the scope is not being cleaned up when the pids within the scope have completed execution, leading to an eventual memory leak. Kubernetes uses systemd-run --scope when creating mount points, that may require drivers to be loaded/running in a separate context from kubelet, which allows the above leak to occur. kubernetes/kubernetes#70324 kubernetes/kubernetes#64137 gravitational/gravity#1219 * change logging level for cgroup cleanup * address review feedback * address review feedback (cherry picked from commit 00ed8e6)
* Implement workaround to clean up leaking cgroups This change implements a cleaner, that scans for cgroups created by systemd-run --scope that do not have any pids assigned, indicating that the cgroup is unused and should be cleaned up. On some systems either due to systemd or the kernel, the scope is not being cleaned up when the pids within the scope have completed execution, leading to an eventual memory leak. Kubernetes uses systemd-run --scope when creating mount points, that may require drivers to be loaded/running in a separate context from kubelet, which allows the above leak to occur. kubernetes/kubernetes#70324 kubernetes/kubernetes#64137 gravitational/gravity#1219 * change logging level for cgroup cleanup * address review feedback * address review feedback (cherry picked from commit 00ed8e6)
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@d3hof: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
CGroup leaking, and out of kernel memory.
There are many CGroups under pattern system.slice/run-r${SOMEID}.scope for different categories, and this seems never get cleaned.
Eventually, this leaking cgroup cause all types of instabilities, including but not limit to:
kubectl logs -f
report no space leftWhat you expected to happen:
Such CGroup should be cleaned up after used.
How to reproduce it (as minimally and precisely as possible):
It happens to all of our on-prem kubernetes nodes.
Anything else we need to know?:
Environment:
Kubernetes version (use
kubectl version
): Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:08:19Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="16.04.2 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.2 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Kernel (e.g.
uname -a
):Linux 4.4.0-62-generic add travis integration #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
(have varoius kernel version)
Install tools:
Others:
/kind bug
The text was updated successfully, but these errors were encountered: