-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misleading error message: job shell exec error on ...: /usr/libexec/flux/flux-imp: No such file or directory
#6568
Comments
grondo
added a commit
to grondo/flux-core
that referenced
this issue
Jan 24, 2025
Problem: When the job-exec module detects an exec error for a job shell it emits a confusing error message that includes either the path to the job shell or the IMP (if a multiuser job), and only the result of `strerror()` for the errno returned from libsubprocess. When using sdexec, this errno is always `ENOENT`, resulting in a confusing error message that seems to indicate that `flux-imp` was not found. It is unhelpful to include `argv[0]` in this error message. It will always be the job shell or the IMP and we all know it. Drop this from the log message. Also, sdexec will provide extra information in the subprocess error string available from `flux_subprocess_fail_error (p)`. Log this instead of `strerror (errno)`. Fixes flux-framework#6568
grondo
added a commit
to grondo/flux-core
that referenced
this issue
Jan 24, 2025
Problem: When the job-exec module detects an exec error for a job shell it emits a confusing error message that includes either the path to the job shell or the IMP (if a multiuser job), and only the result of `strerror()` for the errno returned from libsubprocess. When using sdexec, this errno is always `ENOENT`, resulting in a confusing error message that seems to indicate that `flux-imp` was not found. It is unhelpful to include `argv[0]` in this error message. It will always be the job shell or the IMP and we all know it. Drop this from the log message. Also, sdexec will provide extra information in the subprocess error string available from `flux_subprocess_fail_error (p)`. Log this instead of `strerror (errno)`. Fixes flux-framework#6568
grondo
added a commit
to grondo/flux-core
that referenced
this issue
Jan 24, 2025
Problem: When the job-exec module detects an exec error for a job shell it emits a confusing error message that includes either the path to the job shell or the IMP (if a multiuser job), and only the result of `strerror()` for the errno returned from libsubprocess. When using sdexec, this errno is always `ENOENT`, resulting in a confusing error message that seems to indicate that `flux-imp` was not found. It is unhelpful to include `argv[0]` in this error message. It will always be the job shell or the IMP and we all know it. Drop this from the log message. Also, sdexec will provide extra information in the subprocess error string available from `flux_subprocess_fail_error (p)`. Log this instead of `strerror (errno)`. Fixes flux-framework#6568
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This error message has been seen occasionally, and it is misleading because: 1) the
ENOENT
does not apply to theflux-imp
process (job-exec is just printing the first argument of the commit it ran and 2) in all the cases investigated, the job shell actually started, then later was terminated with this error, e.g. in this case:This error message is printed by the job-exec module when the libsubprocess error_cb is triggered. I do notice that sdexec sends
ENOENT
as a catch-all when the unit has failed for any reason (in this case, I didn't see any sdexec/sdbus errors logged on the affected node, so I'm not sure what that would be) The exec eventlog also definitely shows that the job shell was definitely started:It appears sdexec sets an error message in the error response, which job-exec is currently ignoring. We should amend job-exec to print the subprocess error message when available in
error_cb
.The text was updated successfully, but these errors were encountered: