-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runner Container fails to terminate when runner fails to register with github #169
Comments
@ZacharyBenamram Hey! Thanks for reporting. Which version of the controller are you using? This is likely due to that recent changes in the controller(I'm trying to improve the pre-release testing process to prevent things like this in near future, but not yet) |
@mumoshu - I am running an older version of the controller v0.10.0. I think this may be a persistent issue because the runsvc doesnt exit in the above case. And because it doesnt exit, this condition in runner_controller.go isnt triggered: I think we need to adjust the docker container to exit on the condition where the runner is not configured. |
|
since its treated as a retryable error: Im not sure what the best approach is. Do we not want to exit in this case? I think it might be a good thing to reconsider exiting. or even better - lets create a retryable count limit. Will open a PR to address. |
@ZacharyBenamram Thanks! Before settling the direction may I ask something - Do you have any insight on why your runner is saying "Not configured"? Would there be anything the controller can do (additional validation?) to prevent it from becoming that state? |
it may have been a result of the artifact from the previous broken release - since I'm unable to reproduce when testing for failed pod scheduling. |
I'm running into the same issue with the latest version of summerwind/actions-runner-dind:v2.274.1
|
@eb-trigo Hey! Are you sure you've set up all the Runner config properly? Like setting |
I've been seeing this failure a lot, it seems to happen due to an error in the ./config.sh step that the service assumes will be corrected with manual intervention. So far I've seen the config fail due to expired github tokens and the github api being unresponsive. There's an upstream request to change the behaviour so that the service eventually exits if it is not corrected (actions/runner#879), but pending that, an option might be to exit the entrypoint script if the config step does not report success? |
We've been seeing this happen pretty frequently as well. The issue in our case seems to be expired github tokens. If we leave the runner listening for too long, the tokens expire and are never refreshed so we're never going to be able to successfully reconnect. I think @ZacharyBenamram had the right idea with retrying a few times to make sure it isn't something temporary, and then stopping things. @ZacharyBenamram - I'm happy to open a similar PR if you're busy with other things. |
Related: #248 (comment). If the config.sh failure was due to an expired registration token - it shouldn't happen as the controller should update the token and recreate the pod with the new token once it detected expiration. Maybe that reg token update/pod recreation code isn't working as we intended? 🤔 |
Ah yeah that's probably what's happening if there is something in place to try and update the token and recreate the pod. Would it still make sense to have a max limit on the number of retries? Spinning up new pods in kubernetes or openshift is pretty low cost, so maybe after some number of failures, the controller just kills the pod and lets a new one come up in its place? |
Thanks for the suggestion! If that works, your pod should come up fine after you delete the pod today. Is that the case? |
Yup, that's exactly what happens. The pod gets stuck in this retry loop, and then when I terminate it the controller creates a new pod which connects successfully. If I leave it unused for about a day, it goes back into the retry loop, but otherwise it executes whatever job it picks up and terminates successfully. |
@arjunmadan Thanks for the info! Whoa, interesting. Then the token might have been updated correctly, but the runner-controller might have been unable to recreate the pod with the new token without your manual intervention. Would you mind giving me the logs from your actions-runner-controller's |
Did some digging and it's possible we're running into quota issues when the controller tries getting this pod back up after it refreshes the tokens. Have some logs below that show the token being refreshed. We're going to make those quota changes and see if the problem goes away. Will report back early next week.
|
Yup, so it turns out it was quota issues after all. After we got that sorted out, things seem to be working fine, and I can see the runner reconnects to GitHub after about a day of unuse. |
I think I might be running into this issue as well. I set up the runners on Tuesday, and today Thursday, they were all spewing out that "Not Configured" error. I deleted all the pods, and they all recovered. Which quota did you change and how @arjunmadan? |
Our issue turned out to be with the quotas we gave the namespace. We were maxing out memory, and so while the pod didn't have enough memory available to cleanly restart. Upping the namespace quotas where we were hosting the runner pods helped solve our issue. |
Hey - circling back here. I ran into this issue again and it turned out to be a different flavor of what @arjunmadan has noted above. In my case, the ec2 machine we were running on did not have enough available memory to support the number of runner containers + infrastructure containers i was running on the github actions nodes. Once i bumped the memory of the ec2 node, the problem disappeared. Interesting error though... seems to encapsulate a lot of failure cases. I think we should look back into limiting the number of restarts available. |
This prevents the pod from ever recycling and re-registering.
The above can come about when the pod runner resource has been created, but fails to create the containers because the control plane fails to provision the pod to a node due to scheduling errors.
The text was updated successfully, but these errors were encountered: