Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cgroup leaking, no space left on /sys/fs/cgroup #70324

Closed
c-nuro opened this issue Oct 27, 2018 · 34 comments
Closed

Cgroup leaking, no space left on /sys/fs/cgroup #70324

c-nuro opened this issue Oct 27, 2018 · 34 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@c-nuro
Copy link

c-nuro commented Oct 27, 2018

What happened:
CGroup leaking, and out of kernel memory.

Oct 26 19:07:41  kubelet[1606]: W1026 19:07:41.543128    1606 raw.go:87] Error while processing event ("/sys/fs/cgroup/devices/system.slice/run-r18291965551b44e2bcfc7076348375a5.scope": 0x40000100 == IN_CREATE|IN_ISDIR): inotify_add_watch /sys/fs/cgroup/devices/system.slice/run-r18291965551b44e2bcfc7076348375a5.scope: no space left on device

ls /sys/fs/cgroup/devices/system.slice/run-r* -d | wc
   5920    5920  473577

There are many CGroups under pattern system.slice/run-r${SOMEID}.scope for different categories, and this seems never get cleaned.

Eventually, this leaking cgroup cause all types of instabilities, including but not limit to:

  • kubectl logs -f report no space left
  • pod network interrupt/unaccessible

What you expected to happen:
Such CGroup should be cleaned up after used.

How to reproduce it (as minimally and precisely as possible):
It happens to all of our on-prem kubernetes nodes.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:08:19Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

  • Cloud provider or hardware configuration:

  • OS (e.g. from /etc/os-release):
    NAME="Ubuntu"
    VERSION="16.04.2 LTS (Xenial Xerus)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 16.04.2 LTS"
    VERSION_ID="16.04"
    HOME_URL="http://www.ubuntu.com/"
    SUPPORT_URL="http://help.ubuntu.com/"
    BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
    VERSION_CODENAME=xenial
    UBUNTU_CODENAME=xenial

  • Kernel (e.g. uname -a):
    Linux 4.4.0-62-generic add travis integration #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
    (have varoius kernel version)

  • Install tools:

  • Others:

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 27, 2018
@c-nuro
Copy link
Author

c-nuro commented Oct 27, 2018

@kubernetes/sig-node-bugs

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 27, 2018
@k8s-ci-robot
Copy link
Contributor

@c-nuro: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs

In response to this:

@kubernetes/sig-node-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vsxen
Copy link

vsxen commented Oct 27, 2018

docker version ?
look there
#61937
moby/moby#29638

@c-nuro
Copy link
Author

c-nuro commented Oct 27, 2018

Have both docker 17.03.2ce and 18.03, and happened on both.
And my pod process runs under cgroup /kubepods/burstable/podcf89fac7-d97b-11e8-940c-ac1f6b407d68/e49ce88d70174384c7d03790f2cd448e0de5158299f0cad62e9533123639035c

I do see cadvisor uses system.slice/run-*.scope like cgroups and monitor them, not not sure who is the creator. If I manually delete these cgroups, only relevant syslog is from cadvisor complaining file not found.

If I ran a container with bare docker 17.03.2, I don’t see such pattern of cgroup created. The process runs under cgroup /docker/c9e9cd94a31fef655d56680796631b075922dbcd514f4f6e67667e203b591b5f

@majinghe
Copy link

majinghe commented Nov 1, 2018

Hi, i have the same issue.
Docker version: 17.03.2-ce
kubectl version: v1.10.4
kernel version: 3.10.0-862.14.4.el7.x86_64
find /sys/fs/cgroup/memory -type d | wc -l , result is 2509.
kubelet logs:
Failed to watch directory "/sys/fs/cgroup/memory/kubepods/pod1dfbf2b1-d8eb-11e8-8cfa-060565585076/44251feead9fd539b735333fd9b76b7d3d28d23593681a71642085d727de5b26": inotify_add_watch /sys/fs/cgroup/memory/kubepods/pod1dfbf2b1-d8eb-11e8-8cfa-060565585076/44251feead9fd539b735333fd9b76b7d3d28d23593681a71642085d727de5b26: no space left on device.

The issue leads to k8s nodes notReady and pods does not run fine. Restart docker did not help me, but after server it works fine. It happens several times. Anyone have good ideas to help me to fix this issue. Thanks in advance.

@c-nuro
Copy link
Author

c-nuro commented Nov 20, 2018

I noticed the pattern of cgroup used for mounting pod volumes.

The error from rpcbind is unrelated to this issue, but the output saying it's running with this pattern of cgroup.

Can someone who works on volume mounts take a look?

Mount failed: Mount issued for NFS V3 but unable to run rpcbind:
 Output: rpcbind: another rpcbind is already running. Aborting

@oronboni
Copy link

Have the same problem. K8S 12.1

raw.go:146] Failed to watch directory "/sys/fs/cgroup/devices/system.slice/grub-common.service": inotify_add_watch /sys/fs/cgroup/devices/system.slice/grub-common.service: no space left on device

@xigang
Copy link
Contributor

xigang commented Dec 30, 2018

the same issue.

@Rabbit222
Copy link

I have the same question.
kubectl version: v1.12.1
kernel version: 3.10.0-862.14.4.el7.x86_64
I find an answer about it:
kernel memory is active but the system kernel is not able to apply.
please fix it as soon

@zouyee
Copy link
Member

zouyee commented Feb 11, 2019

opencontainers/runc#1921

@jeff1985
Copy link

Is there a real workaround for this? My cloud provider is telling me the only cure is to restart each cluster node on a daily basis.. Any help appreciated!

@jeff1985
Copy link

jeff1985 commented Feb 21, 2019

This issue might be connected to google/cadvisor#1581

If you take a closer look, you can verify that the problem is inside the function inotify_add_watch.

The default inotify limit on ubuntu is 8192, which can be the limiting factor here.
So I decided to test increasing the limit. I ran this command on one of my cluster nodes:

$ sudo sysctl fs.inotify.max_user_watches=524288

After that I kept watching journalctl -f

In my case the error messages disappered.

@c-nuro Can you test it on your system?

EDIT:
I deployed the following DaemonSet to my Kubernetes and the problem is gone.

apiVersion: "extensions/v1beta1"
kind: "DaemonSet"
metadata:
  name: "sysctl"
  namespace: "default"
spec:
  template:
    metadata:
      labels:
        app: "sysctl"
    spec:
      containers:
        - name: "sysctl"
          image: "busybox:latest"
          resources:
            limits:
              cpu: "10m"
              memory: "8Mi"
            requests:
              cpu: "10m"
              memory: "8Mi"
          securityContext:
            privileged: true
          command:
            - "/bin/sh"
            - "-c"
            - |
              set -o errexit
              set -o xtrace
              while sysctl -w fs.inotify.max_user_watches=525000
              do
                sleep 60s
              done

@clkao
Copy link
Contributor

clkao commented Mar 6, 2019

If you happen to have flexvolume plugin that has pvc mounted under /var/lib/kubelet/plugin, #74669 might be what is exhausting watchers.

@joberget
Copy link

Setting sysctl fs.inotify.max_user_watches=524288 seems to have solved the issue for now for me. We use flexVolume. Any news on a permanent fix for this?

@jayunit100
Copy link
Member

jayunit100 commented May 3, 2019

So, be careful, i guess in some solutions solution above ^ of increasing watch allotment , can be a bandaid that might cause #64137 to occur, ironically. Hence cross referencing these issues to each other as they are closely related (that is, i think certain types of Cgroup leaking is closely related it seems to kubelet CPU hogging) ... for specs, seeing this on 40 core, centos hardware.

@mariojacobo
Copy link

mariojacobo commented Jun 5, 2019

+1 we hit this issue the other day. having just 8192 inotify watches (Kubernetes 1.12.5 on Azure AKS Ubuntu 16.04) seems extremely low. The only viable option here is the DaemonSet as per @jeff1985 , although it would help to understand what exactly is eating up the watches.

@jeff1985
Copy link

jeff1985 commented Jun 5, 2019

@mariojacobo
my understanding is that the cronjob resource is keeping a long history of executed jobs. depending on your execution period, you could end up accumulating lot more pod history, than you actually expect. I had multiple cronjob that needed to be executed each 5minutes.

so what i did in the end, is to implement a stateful set with a simple sleep inside, which would execute the job and then wait until the next interval occures. this seems to be more friendly to the cluster as having the cluster to spin up a new container each time.

@mariojacobo
Copy link

@jeff1985 we are using a DaemonSet, not Cronjob. I thought about Cronjobs initially, but they dont'get scheduled on all the nodes, especially with autoscaling enabled. I tweaked a bit your yaml file and it's working just fine. Our main concern is not being able to tell what's eating up all the inotify watches.

@DLV111
Copy link

DLV111 commented Jun 11, 2019

So, we're also having this problem and via another post there's an inotify_watcher.sh script

#!/usr/bin/env bash
#
# Copyright 2018 (c) Yousong Zhou
#
# This script can be used to debug "no space left on device due to inotify
# "max_user_watches" limit".  It will output processes using inotify methods
# for watching file system activities, along with HOW MANY directories each
# inotify fd watches
#
# A temporary method of working around the said issue above, tune up the limit.
# It's a per-user limit
#
#       sudo sysctl fs.inotify.max_user_watches=81920
#
# In case you also wonder why "sudo systemctl restart sshd" notifies inotify
# errors, it's from blue systemd-tty-ask-password-agent
#
#       execve("/usr/bin/systemd-tty-ask-password-agent", ["/usr/bin/systemd-tty-ask-passwor"..., "--watch"], [/* 16 vars */]) = 0
#       inotify_init1(O_CLOEXEC)                = 4
#       inotify_add_watch(4, "/run/systemd/ask-password", IN_CLOSE_WRITE|IN_MOVED_TO) = -1 ENOSPC (No space left on device)
#
# Sample output
#
#       [yunion@titan yousong]$ sudo bash a.sh  | column -t
#       systemd          /usr/lib/systemd/systemd          1      /proc/1/fdinfo/10     1
#       systemd          /usr/lib/systemd/systemd          1      /proc/1/fdinfo/14     4
#       systemd          /usr/lib/systemd/systemd          1      /proc/1/fdinfo/20     4
#       systemd-udevd    /usr/lib/systemd/systemd-udevd    689    /proc/689/fdinfo/7    4
#       NetworkManager   /usr/sbin/NetworkManager          914    /proc/914/fdinfo/10   5
#       NetworkManager   /usr/sbin/NetworkManager          914    /proc/914/fdinfo/11   4
#       crond            /usr/sbin/crond                   939    /proc/939/fdinfo/5    3
#       rsyslogd         /usr/sbin/rsyslogd                1212   /proc/1212/fdinfo/3   2
#       kube-controller  /usr/bin/kube-controller-manager  4934   /proc/4934/fdinfo/8   1
#       kubelet          /usr/bin/kubelet                  4955   /proc/4955/fdinfo/12  0
#       kubelet          /usr/bin/kubelet                  4955   /proc/4955/fdinfo/17  1
#       kubelet          /usr/bin/kubelet                  4955   /proc/4955/fdinfo/26  51494
#       journalctl       /usr/bin/journalctl               13151  /proc/13151/fdinfo/3  2
#       sdnagent         /opt/yunion/bin/sdnagent          20558  /proc/20558/fdinfo/7  90
#       systemd-udevd    /usr/lib/systemd/systemd-udevd    46019  /proc/46019/fdinfo/7  4
#       systemd-udevd    /usr/lib/systemd/systemd-udevd    46020  /proc/46020/fdinfo/7  4
#
# The script is adapted from https://stackoverflow.com/questions/13758877/how-do-i-find-out-what-inotify-watches-have-been-registered/48938640#48938640
#
set -o errexit
set -o pipefail
lsof +c 0 -n -P -u root \
        | awk '/inotify$/ { gsub(/[urw]$/,"",$4); print $1" "$2" "$4 }' \
        | while read name pid fd; do \
                exe="$(readlink -f /proc/$pid/exe || echo n/a)"; \
                fdinfo="/proc/$pid/fdinfo/$fd" ; \
                count="$(grep -c inotify "$fdinfo" || true)"; \
                echo "$name $exe $pid $fdinfo $count"; \
        done

Output of a system which experiencing this issue is...

# sh inotify_watchers.sh
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/10 1
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/14 4
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/20 4
systemd-udevd /usr/lib/systemd/systemd-udevd 5029 /proc/5029/fdinfo/7 9
NetworkManager /usr/sbin/NetworkManager 9874 /proc/9874/fdinfo/10 5
NetworkManager /usr/sbin/NetworkManager 9874 /proc/9874/fdinfo/11 4
crond /usr/sbin/crond 9909 /proc/9909/fdinfo/5 3
rsyslogd /usr/sbin/rsyslogd 10275 /proc/10275/fdinfo/3 2
kubelet /usr/local/bin/kubelet 27818 /proc/27818/fdinfo/6 1
kubelet /usr/local/bin/kubelet 27818 /proc/27818/fdinfo/11 0
kubelet /usr/local/bin/kubelet 27818 /proc/27818/fdinfo/15 1
kubelet /usr/local/bin/kubelet 27818 /proc/27818/fdinfo/20 71987

So something in kubelet is watching a lot of files.. almost 72k to be exact!

A comparison from another host which is behaving - its < 1 k

# sh inotify_watchers.sh
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/10 1
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/15 4
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/16 4
systemd-udevd /usr/lib/systemd/systemd-udevd 1900 /proc/1900/fdinfo/7 3
rsyslogd /usr/sbin/rsyslogd 4053 /proc/4053/fdinfo/3 2
crond /usr/sbin/crond 4134 /proc/4134/fdinfo/5 3
grafana-watcher /usr/bin/grafana-watcher 27945 /proc/27945/fdinfo/3 1
kubelet /usr/local/bin/kubelet 31305 /proc/31305/fdinfo/5 1
kubelet /usr/local/bin/kubelet 31305 /proc/31305/fdinfo/10 0
kubelet /usr/local/bin/kubelet 31305 /proc/31305/fdinfo/16 1
kubelet /usr/local/bin/kubelet 31305 /proc/31305/fdinfo/24 1
kubelet /usr/local/bin/kubelet 31305 /proc/31305/fdinfo/30 780

What I did notice is that after kubelet successfully re-started(after i increase fs.inotify.max_user_watches=524288, which is very excessive IMO) and I restarted a pod which was in a bad state, over time the watches decreased significantly and this is the same output ~10 mins laster

sh inotify_watchers.sh
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/10 1
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/14 4
systemd /usr/lib/systemd/systemd 1 /proc/1/fdinfo/20 4
systemd-udevd /usr/lib/systemd/systemd-udevd 5029 /proc/5029/fdinfo/7 9
NetworkManager /usr/sbin/NetworkManager 9874 /proc/9874/fdinfo/10 5
NetworkManager /usr/sbin/NetworkManager 9874 /proc/9874/fdinfo/11 4
crond /usr/sbin/crond 9909 /proc/9909/fdinfo/5 3
rsyslogd /usr/sbin/rsyslogd 10275 /proc/10275/fdinfo/3 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/49 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/51 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/53 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/55 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/57 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/59 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/61 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/63 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/65 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/67 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/69 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/80 1
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/83 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/84 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/118 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/120 2
fluentd /usr/bin/ruby2.3 30908 /proc/30908/fdinfo/154 2
kubelet /usr/local/bin/kubelet 42383 /proc/42383/fdinfo/6 1
kubelet /usr/local/bin/kubelet 42383 /proc/42383/fdinfo/7 0
kubelet /usr/local/bin/kubelet 42383 /proc/42383/fdinfo/10 352
kubelet /usr/local/bin/kubelet 42383 /proc/42383/fdinfo/16 1

What I don't know how to trace is what caused the huge spike in the kubelet pid/inotify_watchers?

@derekrprice
Copy link

@reaperes's systemd cgroup cleanup code from #64137 seemed the cleanest and most surgical of all the workarounds that I've found documented for this and the related issues, so I've converted it into a DaemonSet that runs the fix hourly on every node in a cluster. You could set any interval that you like, of course, but the script isn't very resource intensive and hourly seemed reasonable. It actually takes about a day or so for the CPU loading to become noticeable in my cluster and a week or so for it to crash a node. I've been running this for a few days now in my staging cluster and it appears to keep the CPU loading under control.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 6, 2019
@derekrprice
Copy link

derekrprice commented Mar 5, 2020

The daemonset workaround with the surgical cleanup that I posted didn't end up being enough in the end. There's still a small leak that it doesn't take care of that eventually still takes the system down anyhow. We've taken to rebooting the servers nightly to work around this issue. Rather a medieval solution for such a sophisticated tool.

@derekrprice
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 5, 2020
knisbet pushed a commit to gravitational/planet that referenced this issue Mar 11, 2020
This change implements a cleaner, that scans for cgroups created by
systemd-run --scope that do not have any pids assigned, indicating
that the cgroup is unused and should be cleaned up. On some systems
either due to systemd or the kernel, the scope is not being cleaned
up when the pids within the scope have completed execution, leading
to an eventual memory leak.

Kubernetes uses systemd-run --scope when creating mount points,
that may require drivers to be loaded/running in a separate context
from kubelet, which allows the above leak to occur.

kubernetes/kubernetes#70324
kubernetes/kubernetes#64137
gravitational/gravity#1219
knisbet pushed a commit to gravitational/planet that referenced this issue Mar 13, 2020
* Implement workaround to clean up leaking cgroups

This change implements a cleaner, that scans for cgroups created by
systemd-run --scope that do not have any pids assigned, indicating
that the cgroup is unused and should be cleaned up. On some systems
either due to systemd or the kernel, the scope is not being cleaned
up when the pids within the scope have completed execution, leading
to an eventual memory leak.

Kubernetes uses systemd-run --scope when creating mount points,
that may require drivers to be loaded/running in a separate context
from kubelet, which allows the above leak to occur.

kubernetes/kubernetes#70324
kubernetes/kubernetes#64137
gravitational/gravity#1219

* change logging level for cgroup cleanup

* address review feedback

* address review feedback
knisbet pushed a commit to gravitational/planet that referenced this issue Mar 17, 2020
* Implement workaround to clean up leaking cgroups

This change implements a cleaner, that scans for cgroups created by
systemd-run --scope that do not have any pids assigned, indicating
that the cgroup is unused and should be cleaned up. On some systems
either due to systemd or the kernel, the scope is not being cleaned
up when the pids within the scope have completed execution, leading
to an eventual memory leak.

Kubernetes uses systemd-run --scope when creating mount points,
that may require drivers to be loaded/running in a separate context
from kubelet, which allows the above leak to occur.

kubernetes/kubernetes#70324
kubernetes/kubernetes#64137
gravitational/gravity#1219

* change logging level for cgroup cleanup

* address review feedback

* address review feedback

(cherry picked from commit 00ed8e6)
knisbet pushed a commit to gravitational/planet that referenced this issue Mar 17, 2020
* Implement workaround to clean up leaking cgroups

This change implements a cleaner, that scans for cgroups created by
systemd-run --scope that do not have any pids assigned, indicating
that the cgroup is unused and should be cleaned up. On some systems
either due to systemd or the kernel, the scope is not being cleaned
up when the pids within the scope have completed execution, leading
to an eventual memory leak.

Kubernetes uses systemd-run --scope when creating mount points,
that may require drivers to be loaded/running in a separate context
from kubelet, which allows the above leak to occur.

kubernetes/kubernetes#70324
kubernetes/kubernetes#64137
gravitational/gravity#1219

* change logging level for cgroup cleanup

* address review feedback

* address review feedback

(cherry picked from commit 00ed8e6)
knisbet pushed a commit to gravitational/planet that referenced this issue Mar 17, 2020
* Implement workaround to clean up leaking cgroups

This change implements a cleaner, that scans for cgroups created by
systemd-run --scope that do not have any pids assigned, indicating
that the cgroup is unused and should be cleaned up. On some systems
either due to systemd or the kernel, the scope is not being cleaned
up when the pids within the scope have completed execution, leading
to an eventual memory leak.

Kubernetes uses systemd-run --scope when creating mount points,
that may require drivers to be loaded/running in a separate context
from kubelet, which allows the above leak to occur.

kubernetes/kubernetes#70324
kubernetes/kubernetes#64137
gravitational/gravity#1219

* change logging level for cgroup cleanup

* address review feedback

* address review feedback

(cherry picked from commit 00ed8e6)
knisbet pushed a commit to gravitational/planet that referenced this issue Mar 17, 2020
* Implement workaround to clean up leaking cgroups

This change implements a cleaner, that scans for cgroups created by
systemd-run --scope that do not have any pids assigned, indicating
that the cgroup is unused and should be cleaned up. On some systems
either due to systemd or the kernel, the scope is not being cleaned
up when the pids within the scope have completed execution, leading
to an eventual memory leak.

Kubernetes uses systemd-run --scope when creating mount points,
that may require drivers to be loaded/running in a separate context
from kubelet, which allows the above leak to occur.

kubernetes/kubernetes#70324
kubernetes/kubernetes#64137
gravitational/gravity#1219

* change logging level for cgroup cleanup

* address review feedback

* address review feedback

(cherry picked from commit 00ed8e6)
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2020
@a6s5
Copy link

a6s5 commented Jun 26, 2020

/remove-lifecycle stale

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 24, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 24, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dimitrihof
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@d3hof: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests