-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runner not ready after timeout during startup-authentication #288
Comments
@hdoinfodation Hey! Thanks for reporting. The surfacing error seems the same as we saw before in #169. AFAICS, we have seen the following variants of this failure so far:
I also believe the other side of this same problem is errors like So how should we deal with it? To clarify, I believe we have no way to resolve the root causes like resource shortage or expired token, API failure on the controller-side. The controller should not automatically replace the node to scale up. The controller has no way to reliably determine if the runner process within the runner pod is failing or not. The controller has no way to replace the already expired token installed onto the pod without restart, etc. But how about if we just enhanced the controller to automatically recreated the pod with the new token, if it failed to register itself to GitHub within some period? You don't need to manually recreate the runner pod to fix it anymore. Let's say the period was 10 minute - the controller can periodically call If it makes sense, I would definitely like to add some patch to the runner controller for this WDYT? |
I've looked into this a bit further. This error happens when the /runner/.runner config file does not exist in the runner container. Could you exec into the runner container and confirm? Error is thrown here: https://github.com/actions/runner/blob/8a4cb7650850a5b74882be714353362e86f74789/src/Runner.Common/ConfigurationStore.cs#L168 If you look into the /runner/_diag folder and cat the first log file - you might see an exception being thrown, which is most likely the root cause - if not, i would dig around _diag logs to find the exception thats thrown that is not setting the config file appropraitely. I am currently following this thread: actions/runner-images#668 which may be a possible root cause. |
This enhances the controller to recreate the runner pod if the corresponding runner has failed to register itself to GitHub within 10 minutes(currently hard-coded). This should alleviate #288 in case the root cause is some kind of transient failures(network unreliability, GitHub down, temporarly compute resource shortage, etc). Formerly you had to manually detect and remove those pods to unblock the controller. Since this enhancement, the controller does it automatically after 10 minutes after pod creation. Ref #288
@ZacharyBenamram In my case the runner config file did not exist, as the step that should generate it at https://github.com/summerwind/actions-runner-controller/blob/ab1c39de5732d449fe129b64b95b317aab11a6bf/runner/entrypoint.sh#L59 was failing. While it might not resolve the root cause, it might be worth while to have the container exit if that step fails, since the runner logic will keep trying and failing to register without any error state being visible at the k8s level. |
@BrendanGalloway That sounds like a good idea. So, would it be a matter of just adding Anyway, even for your case I believe #297 would help, by detecting the config.sh failure after 10 minute after pod creation and recreate it. Could you confirm? |
This enhances the controller to recreate the runner pod if the corresponding runner has failed to register itself to GitHub within 10 minutes(currently hard-coded). It should alleviate #288 in case the root cause is some kind of transient failures(network unreliability, GitHub down, temporarly compute resource shortage, etc). Formerly you had to manually detect and delete such pods or even force-delete corresponding runners to unblock the controller. Since this enhancement, the controller does the pod deletion automatically after 10 minutes after pod creation, which result in the controller create another pod that might work. Ref #288
Hey folks! So glad I found this discussion here. I'm having the very same issue and could confirm the |
Can we close this? This error turned out to be caused by many transient errors which should be automatically handled in actions-runer-controller 0.18.0 or greater. |
Not sure why stale isn't picking this up. The error was caused by transient issues, improvements have been made with the runner to better handle transient errors so I am closing this. |
Currently we are running the actions-runner-controller in a network environment that is a bit unstable. The following issue has bitten us quite a bit:
It is possible that the runner pod is stuck in a loop, whenever the runner times out to Github during the authentication phase of the start-up.
If it times-out during this step, the runner will be stuck in an infinite loop, while it is not registered in Github.
A manual restart/delete action is required to make it work again. This also has an effect on the autoscaling since the pod is stuck in an infinite loop.
Would have expected that the runner pod crashes itself whenever it times out during the authentication phase, to force the runner to re-authenticate or that there is some kind of retry mechanism after a timeout.
Note, in normal conditions when a runner authenticates succesfully and listens for jobs, when it times out afterwards, it properly reconnects/registers itself to the Github API.
The text was updated successfully, but these errors were encountered: