Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misleading error message: job shell exec error on ...: /usr/libexec/flux/flux-imp: No such file or directory #6568

Closed
grondo opened this issue Jan 22, 2025 · 0 comments · Fixed by #6579

Comments

@grondo
Copy link
Contributor

grondo commented Jan 22, 2025

This error message has been seen occasionally, and it is misleading because: 1) the ENOENT does not apply to the flux-imp process (job-exec is just printing the first argument of the commit it ran and 2) in all the cases investigated, the job shell actually started, then later was terminated with this error, e.g. in this case:

[Jan21 10:14] submit userid=1234 urgency=16 flags=0 version=1
[  +0.023719] jobspec-update attributes.system.bank="guests" attributes.system.project="*"
[  +0.023796] validate
[  +0.039603] depend
[  +0.039660] priority priority=9033
[  +0.049091] alloc
[  +0.049275] prolog-start description="job-manager.prolog"
[  +0.049304] prolog-start description="cray-pals-port-distributor"
[  +0.051457] prolog-finish description="cray-pals-port-distributor" status=0
[  +2.519918] prolog-finish description="job-manager.prolog" status=0
[  +2.532791] start
[  +4.395899] exception type="exec" severity=0 userid=763 note="job shell exec error on broker tioga20 (rank 9): /usr/libexec/flux/flux-imp: No such file or directory"
[  +4.396044] finish status=32512

This error message is printed by the job-exec module when the libsubprocess error_cb is triggered. I do notice that sdexec sends ENOENT as a catch-all when the unit has failed for any reason (in this case, I didn't see any sdexec/sdbus errors logged on the affected node, so I'm not sure what that would be) The exec eventlog also definitely shows that the job shell was definitely started:

[Jan21 10:14] init
[  +0.006209] starting
[  +0.294787] shell.init service="66112-shell-f3vLp8GrgcQP" leader-rank=9 size=1
[  +0.299757] shell.start taskmap={"version":1,"map":[[0,1,1,1]]}
[  +1.866375] shell.task-exit localid=0 rank=0 state="Exited" pid=4071306 wait_status=65280 signaled=0 exitcode=255
[  +1.874942] complete status=32512
[  +1.874969] done

It appears sdexec sets an error message in the error response, which job-exec is currently ignoring. We should amend job-exec to print the subprocess error message when available in error_cb.

        /* If there was an exec error, fail with ENOENT.
         * N.B. we have no way of discerning which exec(2) error occurred,
         * so guess ENOENT.  It could actually be EPERM, for example.
         */
        if (sdexec_unit_has_failed (proc->unit)) {
            flux_error_t error;
            errprintf (&error,
                       "unit process could not be started (systemd error %d)",
                       sdexec_unit_systemd_error (proc->unit));
            exec_respond_error (proc, ENOENT, error.text);
        }
grondo added a commit to grondo/flux-core that referenced this issue Jan 24, 2025
Problem: When the job-exec module detects an exec error for a job
shell it emits a confusing error message that includes either
the path to the job shell or the IMP (if a multiuser job), and
only the result of `strerror()` for the errno returned from
libsubprocess. When using sdexec, this errno is always `ENOENT`,
resulting in a confusing error message that seems to indicate
that `flux-imp` was not found.

It is unhelpful to include `argv[0]` in this error message. It will
always be the job shell or the IMP and we all know it. Drop this
from the log message.

Also, sdexec will provide extra information in the subprocess error
string available from `flux_subprocess_fail_error (p)`. Log this
instead of `strerror (errno)`.

Fixes flux-framework#6568
grondo added a commit to grondo/flux-core that referenced this issue Jan 24, 2025
Problem: When the job-exec module detects an exec error for a job
shell it emits a confusing error message that includes either
the path to the job shell or the IMP (if a multiuser job), and
only the result of `strerror()` for the errno returned from
libsubprocess. When using sdexec, this errno is always `ENOENT`,
resulting in a confusing error message that seems to indicate
that `flux-imp` was not found.

It is unhelpful to include `argv[0]` in this error message. It will
always be the job shell or the IMP and we all know it. Drop this
from the log message.

Also, sdexec will provide extra information in the subprocess error
string available from `flux_subprocess_fail_error (p)`. Log this
instead of `strerror (errno)`.

Fixes flux-framework#6568
grondo added a commit to grondo/flux-core that referenced this issue Jan 24, 2025
Problem: When the job-exec module detects an exec error for a job
shell it emits a confusing error message that includes either
the path to the job shell or the IMP (if a multiuser job), and
only the result of `strerror()` for the errno returned from
libsubprocess. When using sdexec, this errno is always `ENOENT`,
resulting in a confusing error message that seems to indicate
that `flux-imp` was not found.

It is unhelpful to include `argv[0]` in this error message. It will
always be the job shell or the IMP and we all know it. Drop this
from the log message.

Also, sdexec will provide extra information in the subprocess error
string available from `flux_subprocess_fail_error (p)`. Log this
instead of `strerror (errno)`.

Fixes flux-framework#6568
@mergify mergify bot closed this as completed in #6579 Jan 24, 2025
@mergify mergify bot closed this as completed in 0b12b76 Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant