-
Notifications
You must be signed in to change notification settings - Fork 749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
too many open files - Error #1791
Comments
Can you please help us by identifying the open file handles? This will help us diagnose this issue. I think you can run |
Note to self - looks like it does not start-up correctly, metrics server starts and then we get "too many open files". |
@mrkwtz - Can you check the node's max allowed open files - |
Yeah, it seems like even the latest EKS version doesn't support this feature yet. Do you have another idea how to debug that? |
Is there a way to login the EKS node? |
This issue has been automatically marked as stale because it has not had |
@whynowy @alexec I've experienced this bug when tried to switch our EKS nodes from
If I dive into the
So as I understand this, the process inside a pod can extend it's |
@vadimgusev-codefresh - Is this something can be done in the Dockerfile? |
I am also hitting this issue. I see this on sensors and event sources. Rescheduling onto a 'fresh' box does seem to help for a while, but it does eventually start failing. I see this with sqs and webhook (the two that I have tried).
From the node:
I currently cannot attach an ephemeral container because the pod is in 'Error' - it cannot start the main container... just keeps crashing. After forcing it to reschedule on a different node, here is what the ephemeral container says once it has started up:
|
Thanks for sharing all of these! |
In spec:
template:
securityContext:
sysctls:
- name: fs.file-max
value: "your-value" Try it out. |
To my knowledge, fs.file-max is not namespaced, and so cannot be set this way. |
We finally are able to ssh into our nodes. Unfortunately because we're out if inotify user watches / instances (don't know yet) we're not :D It turns out the fs.inotify.max_user_instances setting is pretty low on our nodes although it should be higher. You can take a look here awslabs/amazon-eks-ami#1065 So first we're going to investigate why that's the case and if the problem then still persists I'll post again. |
We found the culprit. The privileged promtail pods (<= v3.0.3) are setting the fs.inotify.max_user_instances to 128. We'll upgrade to at least v3.0.4 and then that error should be gone. Closed for now. Thank you for your patience with us :) |
@mrkwtz that's an interesting finding! We are also using loki/promtail, so might affect us also, will investigate! |
Nice! @mrkwtz |
The issue still exists for us. We're running
From the node I get:
The issue appeared some weeks ago out of the sudden and now randomly comes and goes. |
I added a fresh node pool and moved about 30 sensors to the empty node and got the same error in some of the sensors. Also the sensor pod never auto recovers. After deleting the pod multiple times, it could be that the pod starts properly, but often it doesn't. If multiple sensors are created at the same time the chance for them to fail seams higher. Also if there are more sensors on the same node the chance to fail seams higher. The logs differ sometimes. Occasionally it is:
So metrics isn't necessarily started. I think this ticket should be reopened, in this state argo events isn't working reliably at all. |
I finally found a way to fix it. Execute on node(s) where sensors/event sources are deployed:
|
I used the inotify-consumers script to check the max_user_instances and max_user_watches. The default values on Ubuntu 22.04.2 LTS at the time of this writing.
The result of the script shows user_watches < 1000 so we are nowhere near the limit, but user_instances is 127 for root. I increased this to 256 and confirmed the issue is resolved after restarting the sensor replicaset. I ran |
Why is this closed. Issue still exist with argo-events 1.8.1. All above ideas to check the nodes wher not successfull.
|
Further, even if it works it has to be set manually in the nodes. Which makes this issue still a big pain. |
+1 This should be reopened and fixed. OR this should be documented in the Argo Events requirements (not just K8s) as this will lead to people wasting time trying to implement this in an environment that does not support it. |
Using bare metal Ubuntu 22.04 server LTS with k3s, confirming that this default value is the bottleneck for me as well: For now I've added this to my bootstrapping script for cluster nodes:
But I expect this issue to come back again in the future. Can argo-events even fix this, or is it just out-of-scope? What would a sensible default value be for large scale argo-events deployments? |
Hello, |
Describe the bug
When a Sensor starts it always throws the following error and then exits with ExitCode 1
Unfortunately there are no additional information. I already looked at the nodes where it's running on for an exhaustion of the file descriptors, but everything is looking good there.
When the Sensor runs on a fresh node though it's working fine. But we can't always start fresh nodes and the affected nodes are fine regarding overall resource utilization.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
It starts up normally.
Environment (please complete the following information):
Message from the maintainers:
If you wish to see this enhancement implemented please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
The text was updated successfully, but these errors were encountered: