Several pods do not start, encounter "too many open files" error #2087

jimthompson5802 · 2021-12-13T02:05:28Z

In setting up a kubeflow cluster using the master branch at commit 3dad839f. Four pods encounter too many open files error.

For the k8s cluster, I'm using a local k3d cluster on MacOS (11.6.1): https://k3d.io

At end of deploying kubeflow these are the status of 4 pods.

kubectl get pod -A | grep -v Run | grep -v NAME
kubeflow           ml-pipeline-8c4b99589-gcvmz                              1/2     CrashLoopBackOff   15         63m
kubeflow           kfserving-controller-manager-0                           1/2     CrashLoopBackOff   15         63m
kubeflow           profiles-deployment-89f7d88b-hp697                       1/2     CrashLoopBackOff   15         63m
kubeflow           katib-controller-68c47fbf8b-d6mpj                        0/1     CrashLoopBackOff   16         63m

The cluster has been torn down and rebuilt several times. Each time the same 4 pods encounter the too many open files error. All other pods successfully attain Running status.

According to ulimit -n on the nodes, the nodes have a very high setting for that limit: 1048576. Since this is run on MacOS, configured launchctl to increase the maxfiles from 256 to 524288.

I'm new to kubeflow, so any guidance offered will be appreciated.

Following are the diagnostic data collected:

Log extract from failed pods

kubectl logs ml-pipeline-8c4b99589-gcvmz
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/ml-pipeline-8c4b99589-gcvmz. Please use `kubectl.kubernetes.io/default-container` instead
2021/12/11 13:01:59 too many open files


kubectl logs kfserving-controller-manager-0 -c manager
<<<< deleted info level messages>>>>
{"level":"error","ts":1639227716.1910038,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1beta1.InferenceService Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:547"}
{"level":"error","ts":1639227716.1911373,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1alpha1.TrainedModel Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:547"}
{"level":"error","ts":1639227716.1912212,"logger":"entrypoint","msg":"unable to run the manager","error":"too many open files","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nmain.main\n\t/go/src/github.com/kubeflow/kfserving/cmd/manager/main.go:183\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}


kubectl logs profiles-deployment-89f7d88b-hp697 -c manager
I1211 13:02:40.188855       1 request.go:645] Throttling request took 1.036224909s, request: GET:https://10.43.0.1:443/apis/flows.knative.dev/v1?timeout=32s
2021-12-11T13:02:41.646Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": ":8080"}
2021-12-11T13:02:41.646Z	ERROR	setup	unable to create controller	{"controller": "Profile", "error": "Failed to start file watcher: too many open files", "errorVerbose": "too many open files\nFailed to start file watcher\ngithub.com/kubeflow/kubeflow/components/profile-controller/controllers.(*ProfileReconciler).SetupWithManager\n\t/workspace/controllers/profile_controller.go:381\nmain.main\n\t/workspace/main.go:93\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"}
runtime.main
	/usr/local/go/src/runtime/proc.go:204


  kubectl logs katib-controller-68c47fbf8b-d6mpj
  <<<<<<<< removed info level messages >>>>>>>>>>>>
  {"level":"error","ts":1639227826.322595,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1beta1.Suggestion Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:529"}
  {"level":"error","ts":1639227826.3227415,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1beta1.Experiment Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:529"}
  {"level":"error","ts":1639227826.32281,"logger":"entrypoint","msg":"Unable to run the manager","error":"too many open files","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nmain.main\n\t/go/src/github.com/kubeflow/katib/cmd/katib-controller/v1beta1/main.go:128\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255"}

kubeflow deployed using kustomize build ${component} | kubectl apply -f -
on each of the following compnonents in the order shown:

# cert manager
common/cert-manager/cert-manager/base \
common/cert-manager/kubeflow-issuer/base \

# istio
common/istio-1-9/istio-crds/base \
common/istio-1-9/istio-namespace/base \
common/istio-1-9/istio-install/base \

#DEX
common/dex/overlays/istio \

# OIDC Auth Service
common/oidc-authservice/base \

# knative serving
common/knative/knative-serving/base \
common/istio-1-9/cluster-local-gateway/base \

# inference event logging
common/knative/knative-eventing/base \

# kubeflow namespace
common/kubeflow-namespace/base \

# kubeflow roles
common/kubeflow-roles/base \

# kubeflow istio resources
common/istio-1-9/kubeflow-istio-resources/base \

# kubeflow pipelines
apps/pipeline/upstream/env/platform-agnostic-multi-user-pns \

# KFServing
apps/kfserving/upstream/overlays/kubeflow \

# Katib
apps/katib/upstream/installs/katib-with-kubeflow \

# Central Dashboard
apps/centraldashboard/upstream/overlays/istio \

# Admission Controler
apps/admission-webhook/upstream/overlays/cert-manager \

# Notebooks
apps/jupyter/notebook-controller/upstream/overlays/kubeflow \

# Jupyter web app
apps/jupyter/jupyter-web-app/upstream/overlays/istio \

# Profiles + KFAM
apps/profiles/upstream/overlays/kubeflow \

# Volumes Web app
apps/volumes-web-app/upstream/overlays/istio \

# Tensorboard
apps/tensorboard/tensorboards-web-app/upstream/overlays/istio \

# Training Operator
apps/training-operator/upstream/overlays/kubeflow \

# User Namespace
common/user-namespace/base \

Platform

MacOS: 11.6.1
MacBookPro 2019 (Intel), 16GB RAM

Software Versions:

k3d version
k3d version v5.1.0
k3s version v1.21.5-k3s2 (default)


docker version
Client:
 Cloud integration: v1.0.22
 Version:           20.10.11
 API version:       1.41
 Go version:        go1.16.10
 Git commit:        dea9396
 Built:             Thu Nov 18 00:36:09 2021
 OS/Arch:           darwin/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.11
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.9
  Git commit:       847da18
  Built:            Thu Nov 18 00:35:39 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.12
  GitCommit:        7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5+k3s2", GitCommit:"724ef700bab896ff252a75e2be996d5f4ff1b842", GitTreeState:"clean", BuildDate:"2021-10-05T19:59:14Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}


kustomize version
Version: {KustomizeVersion:3.2.0 GitCommit:a3103f1e62ddb5b696daa3fd359bb6f2e8333b49 BuildDate:2019-09-18T16:26:36Z GoOs:darwin GoArch:amd64}

k3d cluster nodes

kubectl get node -o wide
NAME                    STATUS   ROLES                  AGE   VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE   KERNEL-VERSION     CONTAINER-RUNTIME
k3d-kubeflow-server-0   Ready    control-plane,master   79m   v1.21.5+k3s2   172.19.0.2    <none>        Unknown    5.10.76-linuxkit   containerd://1.4.11-k3s1
k3d-kubeflow-agent-0    Ready    <none>                 79m   v1.21.5+k3s2   172.19.0.3    <none>        Unknown    5.10.76-linuxkit   containerd://1.4.11-k3s1

ulimit for the two nodes

ulimit -a    # on server node

core file size (blocks)         (-c) 0
data seg size (kb)              (-d) unlimited
scheduling priority             (-e) 0
file size (blocks)              (-f) unlimited
pending signals                 (-i) 51481
max locked memory (kb)          (-l) 64
max memory size (kb)            (-m) unlimited
open files                      (-n) 1048576
POSIX message queues (bytes)    (-q) 819200
real-time priority              (-r) 0
stack size (kb)                 (-s) 8192
cpu time (seconds)              (-t) unlimited
max user processes              (-u) unlimited
virtual memory (kb)             (-v) unlimited
file locks                      (-x) unlimited

ulimit -a   # on worker node
core file size (blocks)         (-c) 0
data seg size (kb)              (-d) unlimited
scheduling priority             (-e) 0
file size (blocks)              (-f) unlimited
pending signals                 (-i) 51481
max locked memory (kb)          (-l) 64
max memory size (kb)            (-m) unlimited
open files                      (-n) 1048576
POSIX message queues (bytes)    (-q) 819200
real-time priority              (-r) 0
stack size (kb)                 (-s) 8192
cpu time (seconds)              (-t) unlimited
max user processes              (-u) unlimited
virtual memory (kb)             (-v) unlimited
file locks                      (-x) unlimited

The text was updated successfully, but these errors were encountered:

kimwnasptd · 2022-01-03T18:47:55Z

@jimthompson5802 I've also seen this happening in a KinD cluster I had, for the same Deployments. In my case I mitigated these errors by increasing my laptop's fs.inotify.max_user_{watches,instances} settings.

Not sure if this will also work for k3s though.

jimthompson5802 · 2022-01-05T10:19:56Z

@kimwnasptd thank you for the suggestion.

In my case I mitigated these errors by increasing my laptop's fs.inotify.max_user_{watches,instances} settings.

I just want to confirm the parameter names cited are from Linux. If this is correct, then I believe the equivalent parameter in MacOS are these

$ sysctl -a | grep "kern.maxfiles"
kern.maxfiles: 16777216
kern.maxfilesperproc: 524288

My belief on the parameter names come for this posting.

If this is the case, then the change did not seem to work. What values did you use to get KinD to work.

Again, thank you for taking the time to respond to my question.

skothawa-tibco · 2022-01-06T12:26:33Z

I am also facing similar issue on KIND. some of the pods are going to crashloopbackoff state. Error is as below:
Error starting filewatcher: 'too many open files'. Configuration changes will not be detected!

@kimwnasptd Can you please share what all equivalent setting ( fs.inotify.max_user_{watches,instances} settings) we can do for Mac?
Thanks.

bartgras · 2022-01-10T04:37:01Z

@jimthompson5802 @skothawa-tibco
I'm also using mac, Docker desktop and k3d.
What worked for me was to open docker Preferences -> Docker Engine and add to config:

  "default-ulimits": {
    "nofile": {
      "Soft": 640000,
      "Hard": 640000,
      "Name": "nofile"
    }
  },

Which simply is 10x more than defaults from docker daemon configuration.

Restart the docker.

After killing all crashing pods they got created successfully.

Not sure if related, because I also made other change before rebooting docker. Check if your mysql pod is failing due to too many files open or something else. Mine was crashing because of --initialize specified but the data directory has files in it error. I simply deleted both mysql PV, mysql PVC and recreated them from manifests.

skothawa-tibco · 2022-01-11T05:33:44Z

I am using mac, docker desktop and KIND. @jimthompson5802 tried the above settings but no luck. The above setting are updated on both host machine and on docker daemon configuration. But still issue remains same.
Can someone please look into this?

  "default-ulimits": {
    "nofile": {
      "Soft": 640000,
      "Hard": 640000,
      "Name": "nofile"
    }
  },

ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-v: address space (kbytes)          unlimited
-l: locked-in-memory size (kbytes)  unlimited
-u: processes                       2048
-n: file descriptors                524288

bartgras · 2022-01-11T06:47:04Z

@skothawa-tibco
Out of curiosity, could you try exec in terminal launchctl limit maxfiles 200000, restart docker, kill failing containers and see if that helps?

skothawa-tibco · 2022-01-11T07:16:34Z

On the host machine terminal we can see below values:

launchctl limit maxfiles
	maxfiles    524288         5242880

After exec into worker node getting below error:

docker exec -it 85754ed34564 bash
bash-5.0# launchctl limit maxfiles
bash: launchctl: command not found
bash-5.0#

Below are the running containers of KIND:
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES

85754ed34564   kindest/node:v1.20.7                   "/usr/local/bin/entr…"   32 minutes ago   Up 32 minutes                                                                        worker
f072d9bc99b1   kindest/node:v1.20.7                   "/usr/local/bin/entr…"   32 minutes ago   Up 32 minutes   0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp, 127.0.0.1:6443->6443/tcp   control-plane
5800621e01fd   rpardini/docker-registry-proxy:0.6.3   "/entrypoint.sh"         32 minutes ago   Up 32 minutes   80/tcp, 3128/tcp, 8081-8082/tcp                                      registry-proxy

ulimit values inside worker node:

docker exec -it 85754ed34564 bash
root@tibco-cic-worker:/# ulimit -a
real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) 0
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 95734
max locked memory           (kbytes, -l) 64
max memory size             (kbytes, -m) unlimited
open files                          (-n) 640000
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) unlimited
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

OS details: macOS Monterey 12.1 version

@bartgras We are already having the greater values than you suggested. Let me know if any other pointers that can be tried out.

stale · 2022-04-16T23:17:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in one week if no further activity occurs. Thank you for your contributions.

mstopa · 2022-04-18T15:09:30Z

@jimthompson5802 I've also seen this happening in a KinD cluster I had, for the same Deployments. In my case I mitigated these errors by increasing my laptop's fs.inotify.max_user_{watches,instances} settings.

Not sure if this will also work for k3s though.

This worked for me too:

sudo sysctl fs.inotify.max_user_instances=1280
sudo sysctl fs.inotify.max_user_watches=655360

(10x previous values) solved this problem on k0s instance

minchang · 2022-05-27T04:13:12Z

@jimthompson5802 I've also seen this happening in a KinD cluster I had, for the same Deployments. In my case I mitigated these errors by increasing my laptop's fs.inotify.max_user_{watches,instances} settings.
Not sure if this will also work for k3s though.

This worked for me too:
sudo sysctl fs.inotify.max_user_instances=1280
sudo sysctl fs.inotify.max_user_watches=655360
(10x previous values) solved this problem on k0s instance

Thanks for saving my time. I solved my issue with the above comands.

DomFleischmann · 2022-08-04T11:26:07Z

I've hit this issue today while testing 1.6 on microk8s. The pods affected were: katib-controller, kubeflow-profiles, kfp-api and kfp-persistence.

@mstopa 's workaround did fix it, but I'm wondering if we are doing something wrong in these components for this to occur, could we possibly be more efficient in the way we lease API watchers?

juliusvonkohout · 2023-08-24T16:17:11Z

/close

There has been no activity for a long time. Please reopen if necessary.

google-oss-prow · 2023-08-24T16:17:14Z

@juliusvonkohout: Closing this issue.

In response to this:

/close

There has been no activity for a long time. Please reopen if necessary.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jimthompson5802 mentioned this issue Dec 14, 2021

[BUG] Cluster creation fails when specifying k3s image v1.19.16-k3s1 k3d-io/k3d#890

Closed

stale bot added the lifecycle/stale label Apr 16, 2022

stale bot removed the lifecycle/stale label Apr 18, 2022

tpaschalis mentioned this issue Jul 12, 2022

Grafana Agent Operator - Logs: Too many open files grafana/agent#1844

Closed

DnPlas mentioned this issue Aug 11, 2022

Distributions and Kubeflow 1.6 release #2221

Closed

weizhoublue mentioned this issue Aug 18, 2022

software bug: "too many open files" spidernet-io/spiderpool#586

Closed

DnPlas mentioned this issue Sep 5, 2022

issues with charmed KF 1.6 beta canonical/bundle-kubeflow#485

Closed

unixfox mentioned this issue Jan 22, 2023

sysctl configs unixfox/k8s#93

Open

millerhooks mentioned this issue Apr 5, 2023

ML-Pipelines API Server and Metadata Writer in CrashLoopBackoff kubeflow/pipelines#6121

Closed

rgaiacs mentioned this issue May 2, 2023

Fail to build test repository due too many open files gesiscss/orc2#5

Closed

thesuperzapper mentioned this issue Jun 12, 2023

[backend] using large number of file watchers / instances kubeflow/pipelines#9610

Closed

nice-pink mentioned this issue Jun 16, 2023

too many open files - Error argoproj/argo-events#1791

Closed

rgaiacs mentioned this issue Aug 21, 2023

Image build request to GESIS node are failing jupyterhub/mybinder.org-deploy#2727

Closed

google-oss-prow bot closed this as completed Aug 24, 2023

otosky mentioned this issue Dec 10, 2023

inotifyd - increase file descriptors otosky/home-ops#809

Closed

chengleqi mentioned this issue Jan 3, 2024

The hack/local-up-karmada.sh script did not work properly karmada-io/karmada#4500

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several pods do not start, encounter "too many open files" error #2087

Several pods do not start, encounter "too many open files" error #2087

jimthompson5802 commented Dec 13, 2021

kimwnasptd commented Jan 3, 2022

jimthompson5802 commented Jan 5, 2022

skothawa-tibco commented Jan 6, 2022

bartgras commented Jan 10, 2022

skothawa-tibco commented Jan 11, 2022

bartgras commented Jan 11, 2022

skothawa-tibco commented Jan 11, 2022

stale bot commented Apr 16, 2022

mstopa commented Apr 18, 2022

minchang commented May 27, 2022

DomFleischmann commented Aug 4, 2022

juliusvonkohout commented Aug 24, 2023

google-oss-prow bot commented Aug 24, 2023

Several pods do not start, encounter "too many open files" error #2087

Several pods do not start, encounter "too many open files" error #2087

Comments

jimthompson5802 commented Dec 13, 2021

Log extract from failed pods

Platform

Software Versions:

k3d cluster nodes

ulimit for the two nodes

kimwnasptd commented Jan 3, 2022

jimthompson5802 commented Jan 5, 2022

skothawa-tibco commented Jan 6, 2022

bartgras commented Jan 10, 2022

skothawa-tibco commented Jan 11, 2022

bartgras commented Jan 11, 2022

skothawa-tibco commented Jan 11, 2022

stale bot commented Apr 16, 2022

mstopa commented Apr 18, 2022

minchang commented May 27, 2022

DomFleischmann commented Aug 4, 2022

juliusvonkohout commented Aug 24, 2023

google-oss-prow bot commented Aug 24, 2023