[MSTransferor] Lock the whole input container for growing workflows #11141

amaltaro · 2022-05-12T01:13:42Z

Fixes #11130

Status

ready

Description

With this PR, we have:

a new microservice configuration called openRunning, used to define when to consider a workflow with growing input dataset or not (default value has been set to 7 days)
when an open running workflow is found, make a Rucio rule for the whole input container with grouping DATASET, with a single copy, at a logical OR of all the RSEs matching the workflow+campaign+quota

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

Configuration changes in:
Prod: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/145
Preprod: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/146

cmsdmwmbot · 2022-05-12T01:24:25Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests deleted
- 11 tests added
- 3 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 5 warnings
- 73 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 22 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13201/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-05-12T01:40:05Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests deleted
- 11 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 5 warnings
- 77 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 22 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13202/artifact/artifacts/PullRequestReport.html

Change how dids are defined and improve logging fix lazy logging

cmsdmwmbot · 2022-05-12T02:46:04Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 7 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 1 warnings and errors that must be fixed
- 5 warnings
- 76 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 22 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13203/artifact/artifacts/PullRequestReport.html

amaltaro · 2022-05-12T03:04:03Z

@todor-ivanov @vkuznet if you find any critical problems with this PR, then please leave it open and I will address it once I get up. Otherwise, please merge it and make a new WMCore release (apparently 2.0.3.pre7).

If you have minor suggestions or anything else that is not a blocker, please leave those in a review and I will get those fixed as I work on #10975, which I plan to get done tomorrow as well.

I have just updated this GitLab request with all the configuration changes that we need. It is still missing the usual cmsdist PR. Once you have that ready, please update the ticket such that Imran can proceed with the testbed upgrade. Thanks a lot!

todor-ivanov

Alan, Thanks for making those changes. I have left few comments bellow. Honestly initially I was thinking, given the urgency of creating the next tag for the validation process of the new release, to approve the PR and upon that we resolve the naming and other little concerns later with the other related Changes. But upon rethinking, since there will also be no workflows configured to use the variables in question during this validation, I actually decided to have the discussion resolved before we merge. Sorry for the extra delay. But I really find this misleading.

todor-ivanov · 2022-05-12T07:26:12Z

src/python/WMCore/MicroService/MSTransferor/Workflow.py

+        Retrieve the OpenRunningTimeout parameter for this workflow
+        :return: an integer with the amount of secs
+        """
+        return self.data.get("OpenRunningTimeout", 0)


Just a single question here. How should the default value of 0 seconds be interpreted from the caller side. should it be: Timeout Reached, or No Timeout set at all. Maybe also putting this in the docstring would be helpful together with the exact meaning of the OpenRuningTimeout itself.

todor-ivanov · 2022-05-12T07:31:23Z

src/python/WMCore/MicroService/MSTransferor/Workflow.py

@@ -477,3 +477,10 @@ def getWorkflowGroup(self):
        if self.isRelVal():
            return "relval"
        return "production"
+
+    def getOpenRunningTimeout(self):


The way how this method is called (and the relative workflow and ms-configuration variables) is quite confusing. It gives the feeling that it is related to a time spent in running-open. I myself have fallen into that trap when I first read the PR here. I know - MSTransferor does not deal with running-open states, but just from reading the configuration even, one is leaned towards searching for actual meaning of those variables in another component.

todor-ivanov · 2022-05-12T07:40:12Z

src/python/WMCore/MicroService/MSTransferor/MSTransferor.py

+        if subLevel == "container":
+            grouping = "ALL"
+            dids = [dataIn['name']]
+        elif wflow.getOpenRunningTimeout() > self.msConfig["openRunning"]:


As mentioned bellow (in the actual method definition), just reordering of Open and Running in the naming of this method here in order to not match or resemble another state in the system, makes it confusing. Same for the configuration variable name. It gives the totally wrong impression how this mechanism is triggered.

And why actually do we set a Timeout in the workflow configuration and then we compare it with a statically configured variable for the whole service just to trigger a given behaviour. Having something called Timeout in the workflow description makes me think this is a maximum timeout for a workflow to be in a given state, upon which an action would be taken. While, IIUC, the way how it is used here is not like that. It is more like - if the one who configured the workflow have set that it will have some tolerance of not 'exactly' triggering a service behavior..... which tolerance would be shortened or prolonged by a comparison between the workflow and the the service configuration variable. Honestly I do not understand why wouldn't it be just a Bool flag in the workflow.

todor-ivanov · 2022-05-12T08:28:48Z

Actually I am missing the part that this is the final bit of resolving #11048 which we really need to have it in the system this month.
@amaltaro I am merging it now and creating the new tag. Please follow the comment I have left with a new PR or issue if needed. @vkuznet please do leave your comments as well.

amaltaro · 2022-05-12T11:15:37Z

That's a good comment, Todor.

Originally, this request spec attribute OpenRunningTimeout was implemented to keep workflows in running-open status (workflow status) for that period of time, such that new files would be tried to be added to workqueue elements with open DBS blocks (DBS2 era).

With time and migration to DBS3, we started slowly deprecating that functionality - since DBS3 only contains closed blocks! - and the request status was also moved to a cherrypy thread (I think this is the moment that we lost this coupling).

So there is a minor change of meaning for this OpenRunningTimeout attribute, it's no longer restricted to running-open status and it no longer deals with open blocks in DBS. Instead, while OpenRunningTimeout hasn't been reached in the global WQ inbox element (time since the last block has been added to the global workqueue elements), global workqueue will keep running input data discovery and will create WQ elements whenever a new block is inserted into Rucio/DBS.

[MSTransferor] Lock the whole input container for growing workflows

6dcde87

Change how dids are defined and improve logging fix lazy logging

amaltaro force-pushed the fix-11130 branch from f936eb4 to 6dcde87 Compare May 12, 2022 02:34

amaltaro requested review from todor-ivanov and vkuznet May 12, 2022 02:52

todor-ivanov requested changes May 12, 2022

View reviewed changes

todor-ivanov merged commit 84272e1 into dmwm:master May 12, 2022

amaltaro mentioned this pull request May 12, 2022

[MSTransferor] Create rule for each pileup location #11143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MSTransferor] Lock the whole input container for growing workflows #11141

[MSTransferor] Lock the whole input container for growing workflows #11141

amaltaro commented May 12, 2022 •

edited

Loading

cmsdmwmbot commented May 12, 2022

cmsdmwmbot commented May 12, 2022

cmsdmwmbot commented May 12, 2022

amaltaro commented May 12, 2022

todor-ivanov left a comment

todor-ivanov May 12, 2022

todor-ivanov May 12, 2022

todor-ivanov May 12, 2022

todor-ivanov commented May 12, 2022 •

edited

Loading

amaltaro commented May 12, 2022

[MSTransferor] Lock the whole input container for growing workflows #11141

[MSTransferor] Lock the whole input container for growing workflows #11141

Conversation

amaltaro commented May 12, 2022 • edited Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented May 12, 2022

cmsdmwmbot commented May 12, 2022

cmsdmwmbot commented May 12, 2022

amaltaro commented May 12, 2022

todor-ivanov left a comment

Choose a reason for hiding this comment

todor-ivanov May 12, 2022

Choose a reason for hiding this comment

todor-ivanov May 12, 2022

Choose a reason for hiding this comment

todor-ivanov May 12, 2022

Choose a reason for hiding this comment

todor-ivanov commented May 12, 2022 • edited Loading

amaltaro commented May 12, 2022

amaltaro commented May 12, 2022 •

edited

Loading

todor-ivanov commented May 12, 2022 •

edited

Loading