Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow job template to say jobs should fail if their parent(s) fail. #28

Open
weshinsley opened this issue May 18, 2023 · 2 comments
Open
Labels
enhancement New feature or request

Comments

@weshinsley
Copy link

weshinsley commented May 18, 2023

(Relates to HPC Pack 2019, 6.1.7531.0 and probably earlier)

Feature Request Description

  • Using job submit on the command-line, we can set /parentjobids and have a job queue until other job(s) finishes. This is really useful.

  • By default though, if the parent job fails, the child job remains in the queue and never execute, so we have to cancel those jobs manually.

  • I'm not quite sure if /faildependenttasks fixes this - since I am talking about jobs mainly, rather than tasks. Perhaps it does.

  • If it does, then it would be good if /faildependenttasks could be set as true by default, at the job template level.

Describe Preferred Solution

Option to select Fail Dependent Tasks (or jobs?) in HPC Cluster Manager, in Configuration -> Job Templates -> Job Template Editor -> Add (property) drop down. We already have "Fail on Task Failure", but not "Fail if parent tasks/jobs fail"

Describe Alternatives Considered

Alternatively - I cannot really see a reason why you wouldn't want /faildependenttasks to be on all the time. Presumably it makes no difference if there are no dependent jobs, but I think it's reasonable that all child jobs fail by default if the parent fails.

@weshinsley
Copy link
Author

To follow up - I think /faildependenttasks does not do what I hoped it would, and it may be some additional functionality I am requesting - perhaps /faildependentjobs - in which a job will fail if one of its /parentjobids also fails.

@YutongSun YutongSun added the enhancement New feature or request label Aug 9, 2023
@YutongSun
Copy link
Contributor

@weshinsley , thanks for the feedback. The original design is to keep the child jobs in active Queue state once any of the parent jobs is canceled or failed. Since the canceled or failed parent job can be requeued, the child jobs will run after the requeued parent job completes successfully. If the canceled or failed parent job was deleted from the database after a long period, the queued child jobs would be set to Failed state. I agree we may provide another option to cancel or fail the child jobs immediately after any of the parent jobs is canceled or failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants