too many open files - Error #1791

mrkwtz · 2022-03-31T15:54:39Z

Describe the bug
When a Sensor starts it always throws the following error and then exits with ExitCode 1

{"level":"info","ts":1648741781.4416764,"logger":"argo-events.sensor","caller":"cmd/start.go:73","msg":"starting sensor server","sensorName":"kafka","version":"v1.6.0"}                                          
{"level":"info","ts":1648741781.4422603,"logger":"argo-events.sensor","caller":"metrics/metrics.go:172","msg":"starting metrics server","sensorName":"kafka"}                                                     
2022/03/31 15:49:41 too many open files

Unfortunately there are no additional information. I already looked at the nodes where it's running on for an exhaustion of the file descriptors, but everything is looking good there.

When the Sensor runs on a fresh node though it's working fine. But we can't always start fresh nodes and the affected nodes are fine regarding overall resource utilization.

To Reproduce
Steps to reproduce the behavior:

Start any sensor with a kafka EventSource (I did not test if it also happens with other sources)

Expected behavior
It starts up normally.

Environment (please complete the following information):

Kubernetes: [e.g. v1.19.15-eks-9c63c4]
Argo: v3.2.9
Argo Events: 1.6.0

Message from the maintainers:

If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

The text was updated successfully, but these errors were encountered:

alexec · 2022-03-31T16:35:38Z

Can you please help us by identifying the open file handles? This will help us diagnose this issue. I think you can run lsof to list open file handles. You will need to use a ephemeral container to do this. I do not know if EKS supports this yet (AWS are slow on upgrades).

alexec · 2022-03-31T16:38:07Z

Note to self - looks like it does not start-up correctly, metrics server starts and then we get "too many open files".

whynowy · 2022-03-31T19:15:51Z

@mrkwtz - Can you check the node's max allowed open files - cat /proc/sys/fs/file-max, also check the current open files - lsof | wc -l?

mrkwtz · 2022-04-01T11:18:06Z

Can you please help us by identifying the open file handles? This will help us diagnose this issue. I think you can run lsof to list open file handles. You will need to use a ephemeral container to do this. I do not know if EKS supports this yet (AWS are slow on upgrades).

Yeah, it seems like even the latest EKS version doesn't support this feature yet. Do you have another idea how to debug that?

whynowy · 2022-04-04T16:12:58Z

Is there a way to login the EKS node?

github-actions · 2022-06-04T02:53:14Z

This issue has been automatically marked as stale because it has not had
any activity in the last 60 days. It will be closed if no further activity
occurs. Thank you for your contributions.

vadimgusev-codefresh · 2022-06-09T10:15:48Z

@whynowy @alexec I've experienced this bug when tried to switch our EKS nodes from AL2 to Bottlerocket.
Execing into random pod and issuing ulimit -n gives me:

1048576 on AL2 instance
65536 on Bottlerocket instance

If I dive into the Bottlerocket instance and issue containerd config dump i see this part:

  [plugins."io.containerd.grpc.v1.cri"]
    ...
    process_rlimit_no_file_hard = 1048576
    process_rlimit_no_file_soft = 65536

So as I understand this, the process inside a pod can extend it's soft limits up to the hard level. And sensor pod just doesn't do that.

whynowy · 2022-06-10T06:57:01Z

@vadimgusev-codefresh - Is this something can be done in the Dockerfile?

drpebcak · 2022-07-29T04:11:48Z

I am also hitting this issue. I see this on sensors and event sources. Rescheduling onto a 'fresh' box does seem to help for a while, but it does eventually start failing. I see this with sqs and webhook (the two that I have tried).
The log:

{"level":"info","ts":1659067244.7226589,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"sqs-argo-event-source","version":"v1.7.1"}
{"level":"info","ts":1659067244.7227285,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:403","msg":"Starting event source server...","eventSourceName":"sqs-argo-event-source"}
{"level":"info","ts":1659067244.7228265,"logger":"argo-events.eventsource","caller":"metrics/metrics.go:172","msg":"starting metrics server","eventSourceName":"sqs-argo-event-source"}
2022/07/29 04:00:44 too many open files

From the node:

root@box:/# lsof |wc -l
534122
root@box:/# cat /proc/sys/fs/file-max
9223372036854775807

I currently cannot attach an ephemeral container because the pod is in 'Error' - it cannot start the main container... just keeps crashing. After forcing it to reschedule on a different node, here is what the ephemeral container says once it has started up:

/ # cat /proc/sys/fs/file-max
9223372036854775807
/ # lsof |wc -l
21

whynowy · 2022-07-29T07:05:51Z

I am also hitting this issue. I see this on sensors and event sources. Rescheduling onto a 'fresh' box does seem to help for a while, but it does eventually start failing. I see this with sqs and webhook (the two that I have tried). The log:
{"level":"info","ts":1659067244.7226589,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"sqs-argo-event-source","version":"v1.7.1"}
{"level":"info","ts":1659067244.7227285,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:403","msg":"Starting event source server...","eventSourceName":"sqs-argo-event-source"}
{"level":"info","ts":1659067244.7228265,"logger":"argo-events.eventsource","caller":"metrics/metrics.go:172","msg":"starting metrics server","eventSourceName":"sqs-argo-event-source"}
2022/07/29 04:00:44 too many open files
From the node:
root@box:/# lsof |wc -l
534122
root@box:/# cat /proc/sys/fs/file-max
9223372036854775807
I currently cannot attach an ephemeral container because the pod is in 'Error' - it cannot start the main container... just keeps crashing. After forcing it to reschedule on a different node, here is what the ephemeral container says once it has started up:
/ # cat /proc/sys/fs/file-max
9223372036854775807
/ # lsof |wc -l
21

Thanks for sharing all of these!

whynowy · 2022-07-29T07:09:54Z

In EventSource and Sensor spec, there's a way to configure pod SecurityContext:

spec:
  template:
    securityContext:
      sysctls:
      - name: fs.file-max
        value: "your-value"

Try it out.

drpebcak · 2022-07-29T07:58:52Z

In EventSource and Sensor spec, there's a way to configure pod SecurityContext:
spec:

  template:

    securityContext:

      sysctls:

      - name: fs.file-max

        value: "your-value"
Try it out.

To my knowledge, fs.file-max is not namespaced, and so cannot be set this way.

mrkwtz · 2022-10-28T10:58:44Z

We finally are able to ssh into our nodes. Unfortunately because we're out if inotify user watches / instances (don't know yet) we're not :D

It turns out the fs.inotify.max_user_instances setting is pretty low on our nodes although it should be higher. You can take a look here awslabs/amazon-eks-ami#1065 So first we're going to investigate why that's the case and if the problem then still persists I'll post again.

mrkwtz · 2022-11-02T08:33:19Z

We found the culprit. The privileged promtail pods (<= v3.0.3) are setting the fs.inotify.max_user_instances to 128. We'll upgrade to at least v3.0.4 and then that error should be gone. Closed for now.

Thank you for your patience with us :)

vadimgusev-codefresh · 2022-11-02T08:55:15Z

@mrkwtz that's an interesting finding! We are also using loki/promtail, so might affect us also, will investigate!

mrkwtz · 2022-11-02T10:28:39Z

@vadimgusev-codefresh See here, grafana/helm-charts@bd6e080

whynowy · 2022-11-03T01:05:47Z

Nice! @mrkwtz

nice-pink · 2023-06-14T13:54:13Z

The issue still exists for us. We're running argo-events 1.8.0, argo-cd 2.6.7, Promtail helm version 6.11.3 app version 2.8.2.
The nodes look healthy and the issues appear randomly in sensors and/or event sources.

{"level":"info","ts":1686750556.961388,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"web-apps","version":"v1.8.0"}
2
{"level":"info","ts":1686750556.9620998,"logger":"argo-events.eventsource","caller":"metrics/metrics.go:175","msg":"starting metrics server","eventSourceName":"web-apps"}
1
2023/06/14 13:49:16 too many open files

From the node I get:

/ # cat /proc/sys/fs/file-max
9223372036854775807
/ # lsof |wc -l
5410
/ # lsof |wc -l
5411
/ # lsof |wc -l
5412
/ # lsof |wc -l
5410

The issue appeared some weeks ago out of the sudden and now randomly comes and goes.

nice-pink · 2023-06-15T08:06:46Z

I added a fresh node pool and moved about 30 sensors to the empty node and got the same error in some of the sensors. Also the sensor pod never auto recovers. After deleting the pod multiple times, it could be that the pod starts properly, but often it doesn't.

If multiple sensors are created at the same time the chance for them to fail seams higher. Also if there are more sensors on the same node the chance to fail seams higher.

The logs differ sometimes. Occasionally it is:

2023/06/15 08:20:39 too many open files                                                                                                                                                                                                                                               │
│ {"level":"info","ts":1686817239.5714653,"logger":"argo-events.sensor","caller":"cmd/start.go:84","msg":"starting sensor server","sensorName":"pengine-test","version":"v1.8.0"}                                                                                                       │
│ Stream closed EOF for argo/pengine-test-sensor-jt82s-7b84f7b8c7-z28hs (main)

So metrics isn't necessarily started.

I think this ticket should be reopened, in this state argo events isn't working reliably at all.

nice-pink · 2023-06-16T09:41:15Z

I finally found a way to fix it.

Execute on node(s) where sensors/event sources are deployed:

sysctl fs.inotify.max_user_instances=1280
sysctl fs.inotify.max_user_watches=655360

kubeflow/manifests#2087 (comment)

jameshearttech · 2023-08-04T20:06:00Z

I used the inotify-consumers script to check the max_user_instances and max_user_watches.

The default values on Ubuntu 22.04.2 LTS at the time of this writing.

fs.inotify.max_user_instances=128
fs.inotify.max_user_watches=249493

The result of the script shows user_watches < 1000 so we are nowhere near the limit, but user_instances is 127 for root. I increased this to 256 and confirmed the issue is resolved after restarting the sensor replicaset. I ran sudo sysctl fs.inotify.max_user_instances=256 on all worker nodes.

AlexanderWiechert · 2023-09-28T08:52:14Z

Why is this closed. Issue still exist with argo-events 1.8.1. All above ideas to check the nodes wher not successfull.

{"level":"info","ts":1695890455.5424418,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"webhook","version":"v1.8.1"}
{"level":"info","ts":1695890455.5425568,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:443","msg":"Starting event source server...","eventSourceName":"webhook"}
ERROR 2023/09/28 08:40:55 failed to create watcher: too many open files

nice-pink · 2023-09-28T08:54:42Z

Further, even if it works it has to be set manually in the nodes. Which makes this issue still a big pain.

charles-horel-rogers · 2024-07-09T15:32:03Z

Why is this closed. Issue still exist with argo-events 1.8.1. All above ideas to check the nodes wher not successfull.

{"level":"info","ts":1695890455.5424418,"logger":"argo-events.eventsource","caller":"cmd/start.go:63","msg":"starting eventsource server","eventSourceName":"webhook","version":"v1.8.1"}
{"level":"info","ts":1695890455.5425568,"logger":"argo-events.eventsource","caller":"eventsources/eventing.go:443","msg":"Starting event source server...","eventSourceName":"webhook"}
ERROR 2023/09/28 08:40:55 failed to create watcher: too many open files

+1
how would anyone be able to consider using this app in a KaaS environment where they have little to no control over the node environment?

This should be reopened and fixed. OR this should be documented in the Argo Events requirements (not just K8s) as this will lead to people wasting time trying to implement this in an environment that does not support it.

freeo · 2024-08-09T14:42:22Z

Using bare metal Ubuntu 22.04 server LTS with k3s, confirming that this default value is the bottleneck for me as well:
fs.inotify.max_user_instances=128

For now I've added this to my bootstrapping script for cluster nodes:

    sh -c 'echo "fs.inotify.max_user_instances=1024" >> /etc/sysctl.conf'

But I expect this issue to come back again in the future.

Can argo-events even fix this, or is it just out-of-scope? What would a sensible default value be for large scale argo-events deployments?

thomas-dussouillez · 2025-01-22T09:31:14Z

Hello,
I have the same problem on a GKE environment, I cannot update all nodes manually every time it will scale up/down new nodes.
Can't argo-event come up with an actual solution ?
I don't think increasing max_user_instances will be a durable solution.

mrkwtz added the bug Something isn't working label Mar 31, 2022

whynowy assigned alexec Mar 31, 2022

whynowy unassigned alexec Apr 4, 2022

github-actions bot added the stale label Jun 4, 2022

github-actions bot removed the stale label Jun 10, 2022

whynowy added the pinned label Sep 26, 2022

mrkwtz closed this as completed Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

too many open files - Error #1791

too many open files - Error #1791

mrkwtz commented Mar 31, 2022 •

edited

Loading

alexec commented Mar 31, 2022

alexec commented Mar 31, 2022

whynowy commented Mar 31, 2022

mrkwtz commented Apr 1, 2022

whynowy commented Apr 4, 2022

github-actions bot commented Jun 4, 2022

vadimgusev-codefresh commented Jun 9, 2022

whynowy commented Jun 10, 2022

drpebcak commented Jul 29, 2022

whynowy commented Jul 29, 2022

whynowy commented Jul 29, 2022

drpebcak commented Jul 29, 2022

mrkwtz commented Oct 28, 2022

mrkwtz commented Nov 2, 2022

vadimgusev-codefresh commented Nov 2, 2022

mrkwtz commented Nov 2, 2022

whynowy commented Nov 3, 2022

nice-pink commented Jun 14, 2023 •

edited

Loading

nice-pink commented Jun 15, 2023 •

edited

Loading

nice-pink commented Jun 16, 2023

jameshearttech commented Aug 4, 2023 •

edited

Loading

AlexanderWiechert commented Sep 28, 2023

nice-pink commented Sep 28, 2023

charles-horel-rogers commented Jul 9, 2024

freeo commented Aug 9, 2024

thomas-dussouillez commented Jan 22, 2025 •

edited

Loading

too many open files - Error #1791

too many open files - Error #1791

Comments

mrkwtz commented Mar 31, 2022 • edited Loading

alexec commented Mar 31, 2022

alexec commented Mar 31, 2022

whynowy commented Mar 31, 2022

mrkwtz commented Apr 1, 2022

whynowy commented Apr 4, 2022

github-actions bot commented Jun 4, 2022

vadimgusev-codefresh commented Jun 9, 2022

whynowy commented Jun 10, 2022

drpebcak commented Jul 29, 2022

whynowy commented Jul 29, 2022

whynowy commented Jul 29, 2022

drpebcak commented Jul 29, 2022

mrkwtz commented Oct 28, 2022

mrkwtz commented Nov 2, 2022

vadimgusev-codefresh commented Nov 2, 2022

mrkwtz commented Nov 2, 2022

whynowy commented Nov 3, 2022

nice-pink commented Jun 14, 2023 • edited Loading

nice-pink commented Jun 15, 2023 • edited Loading

nice-pink commented Jun 16, 2023

jameshearttech commented Aug 4, 2023 • edited Loading

AlexanderWiechert commented Sep 28, 2023

nice-pink commented Sep 28, 2023

charles-horel-rogers commented Jul 9, 2024

freeo commented Aug 9, 2024

thomas-dussouillez commented Jan 22, 2025 • edited Loading

mrkwtz commented Mar 31, 2022 •

edited

Loading

nice-pink commented Jun 14, 2023 •

edited

Loading

nice-pink commented Jun 15, 2023 •

edited

Loading

jameshearttech commented Aug 4, 2023 •

edited

Loading

thomas-dussouillez commented Jan 22, 2025 •

edited

Loading