Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too many open files - Error #1791

Closed
mrkwtz opened this issue Mar 31, 2022 · 26 comments
Closed

too many open files - Error #1791

mrkwtz opened this issue Mar 31, 2022 · 26 comments
Labels
bug Something isn't working pinned

Comments

@mrkwtz
Copy link

mrkwtz commented Mar 31, 2022

Describe the bug
When a Sensor starts it always throws the following error and then exits with ExitCode 1

{"level":"info","ts":1648741781.4416764,"logger":"argo-events.sensor","caller":"cmd/start.go:73","msg":"starting sensor server","sensorName":"kafka","version":"v1.6.0"}                                          
{"level":"info","ts":1648741781.4422603,"logger":"argo-events.sensor","caller":"metrics/metrics.go:172","msg":"starting metrics server","sensorName":"kafka"}                                                     
2022/03/31 15:49:41 too many open files   

Unfortunately there are no additional information. I already looked at the nodes where it's running on for an exhaustion of the file descriptors, but everything is looking good there.

When the Sensor runs on a fresh node though it's working fine. But we can't always start fresh nodes and the affected nodes are fine regarding overall resource utilization.

To Reproduce
Steps to reproduce the behavior:

  1. Start any sensor with a kafka EventSource (I did not test if it also happens with other sources)

Expected behavior
It starts up normally.

Environment (please complete the following information):

  • Kubernetes: [e.g. v1.19.15-eks-9c63c4]
  • Argo: v3.2.9
  • Argo Events: 1.6.0

Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@mrkwtz mrkwtz added the bug Something isn't working label Mar 31, 2022
@alexec
Copy link
Contributor

alexec commented Mar 31, 2022

Can you please help us by identifying the open file handles? This will help us diagnose this issue. I think you can run lsof to list open file handles. You will need to use a ephemeral container to do this. I do not know if EKS supports this yet (AWS are slow on upgrades).

@alexec
Copy link
Contributor

alexec commented Mar 31, 2022

Note to self - looks like it does not start-up correctly, metrics server starts and then we get "too many open files".

@whynowy
Copy link
Member

whynowy commented Mar 31, 2022

@mrkwtz - Can you check the node's max allowed open files - cat /proc/sys/fs/file-max, also check the current open files - lsof | wc -l?

@mrkwtz
Copy link
Author

mrkwtz commented Apr 1, 2022

Can you please help us by identifying the open file handles? This will help us diagnose this issue. I think you can run lsof to list open file handles. You will need to use a ephemeral container to do this. I do not know if EKS supports this yet (AWS are slow on upgrades).

Yeah, it seems like even the latest EKS version doesn't support this feature yet. Do you have another idea how to debug that?

@whynowy
Copy link
Member

whynowy commented Apr 4, 2022

Is there a way to login the EKS node?

@github-actions
Copy link
Contributor

github-actions bot commented Jun 4, 2022

This issue has been automatically marked as stale because it has not had
any activity in the last 60 days. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jun 4, 2022
@vadimgusev-codefresh
Copy link

@whynowy @alexec I've experienced this bug when tried to switch our EKS nodes from AL2 to Bottlerocket.
Execing into random pod and issuing ulimit -n gives me:

  • 1048576 on AL2 instance
  • 65536 on Bottlerocket instance

If I dive into the Bottlerocket instance and issue containerd config dump i see this part:

  [plugins."io.containerd.grpc.v1.cri"]
    ...
    process_rlimit_no_file_hard = 1048576
    process_rlimit_no_file_soft = 65536

So as I understand this, the process inside a pod can extend it's soft limits up to the hard level. And sensor pod just doesn't do that.

@github-actions github-actions bot removed the stale label Jun 10, 2022
@whynowy
Copy link
Member

whynowy commented Jun 10, 2022

@vadimgusev-codefresh - Is this something can be done in the Dockerfile?

@drpebcak
Copy link

I am also hitting this issue. I see this on sensors and event sources. Rescheduling onto a 'fresh' box does seem to help for a while, but it does eventually start failing. I see this with sqs and webhook (the two that I have tried).
The log:

{"level":"info","ts":1659067244.7226589,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"sqs-argo-event-source","version":"v1.7.1"}
{"level":"info","ts":1659067244.7227285,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:403","msg":"Starting event source server...","eventSourceName":"sqs-argo-event-source"}
{"level":"info","ts":1659067244.7228265,"logger":"argo-events.eventsource","caller":"metrics/metrics.go:172","msg":"starting metrics server","eventSourceName":"sqs-argo-event-source"}
2022/07/29 04:00:44 too many open files

From the node:

root@box:/# lsof |wc -l
534122
root@box:/# cat /proc/sys/fs/file-max
9223372036854775807

I currently cannot attach an ephemeral container because the pod is in 'Error' - it cannot start the main container... just keeps crashing. After forcing it to reschedule on a different node, here is what the ephemeral container says once it has started up:

/ # cat /proc/sys/fs/file-max
9223372036854775807
/ # lsof |wc -l
21

@whynowy
Copy link
Member

whynowy commented Jul 29, 2022

I am also hitting this issue. I see this on sensors and event sources. Rescheduling onto a 'fresh' box does seem to help for a while, but it does eventually start failing. I see this with sqs and webhook (the two that I have tried). The log:

{"level":"info","ts":1659067244.7226589,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"sqs-argo-event-source","version":"v1.7.1"}
{"level":"info","ts":1659067244.7227285,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:403","msg":"Starting event source server...","eventSourceName":"sqs-argo-event-source"}
{"level":"info","ts":1659067244.7228265,"logger":"argo-events.eventsource","caller":"metrics/metrics.go:172","msg":"starting metrics server","eventSourceName":"sqs-argo-event-source"}
2022/07/29 04:00:44 too many open files

From the node:

root@box:/# lsof |wc -l
534122
root@box:/# cat /proc/sys/fs/file-max
9223372036854775807

I currently cannot attach an ephemeral container because the pod is in 'Error' - it cannot start the main container... just keeps crashing. After forcing it to reschedule on a different node, here is what the ephemeral container says once it has started up:

/ # cat /proc/sys/fs/file-max
9223372036854775807
/ # lsof |wc -l
21

Thanks for sharing all of these!

@whynowy
Copy link
Member

whynowy commented Jul 29, 2022

In EventSource and Sensor spec, there's a way to configure pod SecurityContext:

spec:
  template:
    securityContext:
      sysctls:
      - name: fs.file-max
        value: "your-value"

Try it out.

@drpebcak
Copy link

In EventSource and Sensor spec, there's a way to configure pod SecurityContext:

spec:

  template:

    securityContext:

      sysctls:

      - name: fs.file-max

        value: "your-value"

Try it out.

To my knowledge, fs.file-max is not namespaced, and so cannot be set this way.

@whynowy whynowy added the pinned label Sep 26, 2022
@mrkwtz
Copy link
Author

mrkwtz commented Oct 28, 2022

We finally are able to ssh into our nodes. Unfortunately because we're out if inotify user watches / instances (don't know yet) we're not :D

It turns out the fs.inotify.max_user_instances setting is pretty low on our nodes although it should be higher. You can take a look here awslabs/amazon-eks-ami#1065 So first we're going to investigate why that's the case and if the problem then still persists I'll post again.

@mrkwtz
Copy link
Author

mrkwtz commented Nov 2, 2022

We found the culprit. The privileged promtail pods (<= v3.0.3) are setting the fs.inotify.max_user_instances to 128. We'll upgrade to at least v3.0.4 and then that error should be gone. Closed for now.

Thank you for your patience with us :)

@mrkwtz mrkwtz closed this as completed Nov 2, 2022
@vadimgusev-codefresh
Copy link

@mrkwtz that's an interesting finding! We are also using loki/promtail, so might affect us also, will investigate!

@mrkwtz
Copy link
Author

mrkwtz commented Nov 2, 2022

@whynowy
Copy link
Member

whynowy commented Nov 3, 2022

Nice! @mrkwtz

@nice-pink
Copy link

nice-pink commented Jun 14, 2023

The issue still exists for us. We're running argo-events 1.8.0, argo-cd 2.6.7, Promtail helm version 6.11.3 app version 2.8.2.
The nodes look healthy and the issues appear randomly in sensors and/or event sources.

{"level":"info","ts":1686750556.961388,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"web-apps","version":"v1.8.0"}
2
{"level":"info","ts":1686750556.9620998,"logger":"argo-events.eventsource","caller":"metrics/metrics.go:175","msg":"starting metrics server","eventSourceName":"web-apps"}
1
2023/06/14 13:49:16 too many open files

From the node I get:

/ # cat /proc/sys/fs/file-max
9223372036854775807
/ # lsof |wc -l
5410
/ # lsof |wc -l
5411
/ # lsof |wc -l
5412
/ # lsof |wc -l
5410

The issue appeared some weeks ago out of the sudden and now randomly comes and goes.

@nice-pink
Copy link

nice-pink commented Jun 15, 2023

I added a fresh node pool and moved about 30 sensors to the empty node and got the same error in some of the sensors. Also the sensor pod never auto recovers. After deleting the pod multiple times, it could be that the pod starts properly, but often it doesn't.

If multiple sensors are created at the same time the chance for them to fail seams higher. Also if there are more sensors on the same node the chance to fail seams higher.

The logs differ sometimes. Occasionally it is:

2023/06/15 08:20:39 too many open files                                                                                                                                                                                                                                               │
│ {"level":"info","ts":1686817239.5714653,"logger":"argo-events.sensor","caller":"cmd/start.go:84","msg":"starting sensor server","sensorName":"pengine-test","version":"v1.8.0"}                                                                                                       │
│ Stream closed EOF for argo/pengine-test-sensor-jt82s-7b84f7b8c7-z28hs (main)  

So metrics isn't necessarily started.

I think this ticket should be reopened, in this state argo events isn't working reliably at all.

@nice-pink
Copy link

I finally found a way to fix it.

Execute on node(s) where sensors/event sources are deployed:

sysctl fs.inotify.max_user_instances=1280
sysctl fs.inotify.max_user_watches=655360

kubeflow/manifests#2087 (comment)

@jameshearttech
Copy link

jameshearttech commented Aug 4, 2023

I used the inotify-consumers script to check the max_user_instances and max_user_watches.

The default values on Ubuntu 22.04.2 LTS at the time of this writing.

  • fs.inotify.max_user_instances=128
  • fs.inotify.max_user_watches=249493

The result of the script shows user_watches < 1000 so we are nowhere near the limit, but user_instances is 127 for root. I increased this to 256 and confirmed the issue is resolved after restarting the sensor replicaset. I ran sudo sysctl fs.inotify.max_user_instances=256 on all worker nodes.

@AlexanderWiechert
Copy link

Why is this closed. Issue still exist with argo-events 1.8.1. All above ideas to check the nodes wher not successfull.

{"level":"info","ts":1695890455.5424418,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"webhook","version":"v1.8.1"}
{"level":"info","ts":1695890455.5425568,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:443","msg":"Starting event source server...","eventSourceName":"webhook"}
ERROR 2023/09/28 08:40:55 failed to create watcher: too many open files

@nice-pink
Copy link

Further, even if it works it has to be set manually in the nodes. Which makes this issue still a big pain.

@charles-horel-rogers
Copy link

Why is this closed. Issue still exist with argo-events 1.8.1. All above ideas to check the nodes wher not successfull.

{"level":"info","ts":1695890455.5424418,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"webhook","version":"v1.8.1"}
{"level":"info","ts":1695890455.5425568,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:443","msg":"Starting event source server...","eventSourceName":"webhook"}
ERROR 2023/09/28 08:40:55 failed to create watcher: too many open files

+1
how would anyone be able to consider using this app in a KaaS environment where they have little to no control over the node environment?

This should be reopened and fixed. OR this should be documented in the Argo Events requirements (not just K8s) as this will lead to people wasting time trying to implement this in an environment that does not support it.

@freeo
Copy link

freeo commented Aug 9, 2024

Using bare metal Ubuntu 22.04 server LTS with k3s, confirming that this default value is the bottleneck for me as well:
fs.inotify.max_user_instances=128

For now I've added this to my bootstrapping script for cluster nodes:

    sh -c 'echo "fs.inotify.max_user_instances=1024" >> /etc/sysctl.conf'

But I expect this issue to come back again in the future.

Can argo-events even fix this, or is it just out-of-scope? What would a sensible default value be for large scale argo-events deployments?

@thomas-dussouillez
Copy link

thomas-dussouillez commented Jan 22, 2025

Hello,
I have the same problem on a GKE environment, I cannot update all nodes manually every time it will scale up/down new nodes.
Can't argo-event come up with an actual solution ?
I don't think increasing max_user_instances will be a durable solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pinned
Projects
None yet
Development

No branches or pull requests