-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MSTransferor] Lock the whole input container for growing workflows #11141
Conversation
Jenkins results:
|
Jenkins results:
|
Change how dids are defined and improve logging fix lazy logging
Jenkins results:
|
@todor-ivanov @vkuznet if you find any critical problems with this PR, then please leave it open and I will address it once I get up. Otherwise, please merge it and make a new WMCore release (apparently 2.0.3.pre7). If you have minor suggestions or anything else that is not a blocker, please leave those in a review and I will get those fixed as I work on #10975, which I plan to get done tomorrow as well. I have just updated this GitLab request with all the configuration changes that we need. It is still missing the usual cmsdist PR. Once you have that ready, please update the ticket such that Imran can proceed with the testbed upgrade. Thanks a lot! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alan, Thanks for making those changes. I have left few comments bellow. Honestly initially I was thinking, given the urgency of creating the next tag for the validation process of the new release, to approve the PR and upon that we resolve the naming and other little concerns later with the other related Changes. But upon rethinking, since there will also be no workflows configured to use the variables in question during this validation, I actually decided to have the discussion resolved before we merge. Sorry for the extra delay. But I really find this misleading.
Retrieve the OpenRunningTimeout parameter for this workflow | ||
:return: an integer with the amount of secs | ||
""" | ||
return self.data.get("OpenRunningTimeout", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a single question here. How should the default value of 0 seconds be interpreted from the caller side. should it be: Timeout Reached
, or No Timeout set at all
. Maybe also putting this in the docstring would be helpful together with the exact meaning of the OpenRuningTimeout
itself.
@@ -477,3 +477,10 @@ def getWorkflowGroup(self): | |||
if self.isRelVal(): | |||
return "relval" | |||
return "production" | |||
|
|||
def getOpenRunningTimeout(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way how this method is called (and the relative workflow and ms-configuration variables) is quite confusing. It gives the feeling that it is related to a time spent in running-open
. I myself have fallen into that trap when I first read the PR here. I know - MSTransferor does not deal with running-open
states, but just from reading the configuration even, one is leaned towards searching for actual meaning of those variables in another component.
if subLevel == "container": | ||
grouping = "ALL" | ||
dids = [dataIn['name']] | ||
elif wflow.getOpenRunningTimeout() > self.msConfig["openRunning"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned bellow (in the actual method definition), just reordering of Open
and Running
in the naming of this method here in order to not match or resemble another state in the system, makes it confusing. Same for the configuration variable name. It gives the totally wrong impression how this mechanism is triggered.
And why actually do we set a Timeout
in the workflow configuration and then we compare it with a statically configured variable for the whole service just to trigger a given behaviour. Having something called Timeout in the workflow description makes me think this is a maximum timeout for a workflow to be in a given state, upon which an action would be taken. While, IIUC, the way how it is used here is not like that. It is more like - if the one who configured the workflow have set that it will have some tolerance of not 'exactly' triggering a service behavior..... which tolerance would be shortened or prolonged by a comparison between the workflow and the the service configuration variable. Honestly I do not understand why wouldn't it be just a Bool flag in the workflow.
Actually I am missing the part that this is the final bit of resolving #11048 which we really need to have it in the system this month. |
That's a good comment, Todor. Originally, this request spec attribute With time and migration to DBS3, we started slowly deprecating that functionality - since DBS3 only contains closed blocks! - and the request status was also moved to a cherrypy thread (I think this is the moment that we lost this coupling). So there is a minor change of meaning for this |
Fixes #11130
Status
ready
Description
With this PR, we have:
openRunning
, used to define when to consider a workflow with growing input dataset or not (default value has been set to 7 days)Is it backward compatible (if not, which system it affects?)
YES
Related PRs
None
External dependencies / deployment changes
Configuration changes in:
Prod: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/145
Preprod: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/146