-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure aborted/force-complete workflows leave no elements behind #11113
Conversation
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find very confusing which naming convention style should be used for states. On one side you have all lower case states like, aborted
, running-open
, on another within the code you use Camel case, like Done
, Running
. I think I understand that lower case used in WQ DB records, while Camel case within the code, but such convention can lead to simple mistake when somewhere in a code someone will use different style.
Apart from that the changes are fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amaltaro for providing those changes. I have left two minor comments inline. You may want to take a look. Other than that the code looks good to me.
|
||
# Workflow state transition automatically controlled by ReqMgr2 | ||
# Specific to workflows either aborted or force-completed | ||
CANCEL_AUTO_TRANSITION = {"aborted": "aborted-completed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't those "allowed" status transitions (here aborted-completed
) be enclosed in a list instead of pure strings. I'd say a list here would be better at least for staying consistent with the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could, but there is no real need to have a list because it's a single transition (instead of chained transition for some of the previous ones). This makes the source code slightly more clear as well, so I think it's a reasonable tradeoff.
@@ -37,11 +37,15 @@ def convertWQElementsStatusToWFStatus(elementsStatusSet): | |||
running = set(["Running"]) | |||
runningOpen = set(["Available", "Negotiating", "Acquired"]) | |||
runningClosed = set(["Running", "Done", "Canceled"]) | |||
aborted = set(["CancelRequested", "Done", "Canceled", "Failed"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if putting the CancelRequested
status under the set of statuses covered by the aborted
does not make any difference in the note from above:
NOTE: CancelRequested status is a transient status and it should not trigger
any request status transition (thus, None gets returned).
I mean it could now trigger a request status transition (IINM) because of line 48 from bellow, which says:
if elementsStatusSet == forceCompleted: # all WQEs in CancelRequested
return "aborted"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I definitely need to update the docstring.
This aborted
state here should not trigger any status transition, as implemented:
https://github.com/dmwm/WMCore/pull/11113/files#diff-b42637227c688d15bfc68058c1005999390006897e0abc07b1a8b6989039afb1R18
and
https://github.com/dmwm/WMCore/pull/11113/files#diff-b42637227c688d15bfc68058c1005999390006897e0abc07b1a8b6989039afb1R62
but I had to implement it to be able to differentiate options that were returning a None
value.
|
||
# get configuration file path | ||
if "WMAGENT_CONFIG" not in os.environ: | ||
os.environ['WMAGENT_CONFIG'] = '/data/srv/wmagent/current/config/wmagent/config.py' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this is just a comment here, no need to be changed with this PR (because it is in bin
and supposedly the script should be used for manual workflow cancellation). But in any case, I think we should avoid hard coding a global path inside those scripts. We should rather stick to relative paths starting from the root of the deployment destination, which may change in the future (e.g if/once we move the wmagent
deployment procedure under a virtual environment with the freedom to deploy anywhere but not just at /data/srv
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WMAgent ROOT directory would have to be defined within the WMAgent/WMCore setup as well, and I think if WMAGENT_CONFIG isn't set, changes that the ROOT_DIR (or whatever name that is) would not be set as well.
This is a fallback though, so it should be good enough until we come back and update all of these bin/ scripts (I am pretty sure we have a few with such hard-coded path). Please keep it in mind as we get closer to the virtual environment setup ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep it in mind as we get closer to the virtual environment setup ;)
Indeed, this was the reason I mentioned it. No change needed right now.
@vkuznet yes, I completely second your comment. As you noted, we have many structs/objects in WMCore and they have a different set of status (sometimes called states) and some of those were implemented following a different convention. Yes, WorkQueue element properties and status are all upper CamelCase. This code is basically deciding what status should the request be set, according to its aggregation of workqueue element statuses. Todor, Valentin, the |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the follow up @amaltaro
the PR looks good to me.
… workflows CancelRequested plus final state should be considered as aborted renamed constant for work cancelation rename workqueue aggregate status from aborted to canceled
update unit tests
Thanks for the re-review Todor. This needs to be further tested, but instead of running the whole deployment/workflow in my VMs, I will hand this over to Imran for a testbed deployment tomorrow. |
Jenkins results:
|
Fixes #11112
Status
not-tested
Description
First commit contains a script to be used to cancel WQEs at the global workqueue level. It can be used like:
Second commit contains changes to how aborted/force-complete workflows are dealt with in the ReqMgr2 CherryPy thread responsible for status transition. Including:
cancelWorkflow
request is triggered once againaborted
status, which in this context means that all the WQEs are sitting inCancelRequested
stateCancelRequested
to the same stateNOTE: the logic to avoid pulling work down for aborted workflows would demand a substantial amount of changes, so I will skip it and rely only on the cherrypy thread mechanism.
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
None
External dependencies / deployment changes
None