[MSTransferor] Create rule for each pileup location #11143

amaltaro · 2022-05-12T22:29:34Z

Status

ready

Description

The initial objective with this PR was the following:

enforce the campaign pileup configuration, such that each location defined there gets a rucio rule for the whole container (grouping=ALL).
secondary AAA can trigger secondary input data location as well, in case a location defined in the campaign does not have a rule locking it
for workflows with multiple pileup datasets, use an intersection of their location defined in the campaign configuration for the primary/parent data placement
as before, the pileup location defines the workflow sitelist (for primary and parent data placement). If secondary is enabled, then primary/parent can go to other locations as defined in the sitelist.
special handling for RelVal, which do not define secondaries in the campaign, in that case simply use the workflow sitelist as candidate location for the pileup dataset(s)

However, I decided to refactor many things that were in place to deal with the old PhEDEx-based data placement. A summary of those changes are:

no longer increment RSE usage within our own cache, instead only rely on what Rucio provides
no longer create chunks of primary/parent input blocks, simply place all the blocks with grouping=DATASET against a logical OR of the final RSEs
DQMHarvest workflows will keep getting their input blocks locked with grouping=ALL against a logical OR of the final RSEs (we no longer pick 1 RSE for that)
workflows with OpenRunningTimeout will get a rule locking the whole input dataset with grouping=DATASET against a logical OR of the final RSEs
workflows with primary and parent dataset will get a rule locking all the input blocks together, thus grouping=ALL against a logical OR of the final RSEs
completely remove the logic to find the best RSE, let Rucio take care of that.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

Given that it modifies a recent feature introduced with: #11141
it needs to be properly tested again.

External dependencies / deployment changes

Service configuration disabling verbose logs:
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/147
and
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/148

cmsdmwmbot · 2022-05-12T22:40:01Z

Jenkins results:

Python3 Unit tests: succeeded
- 7 tests added
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 5 warnings
- 61 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13205/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-05-13T00:10:57Z

Jenkins results:

Python3 Unit tests: succeeded
- 7 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 5 warnings
- 66 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13206/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-05-13T00:17:23Z

Jenkins results:

Python3 Unit tests: succeeded
- 7 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 5 warnings
- 66 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13207/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-05-13T00:27:26Z

Jenkins results:

Python3 Unit tests: succeeded
- 7 tests added
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 5 warnings
- 66 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13208/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-05-13T00:54:26Z

Jenkins results:

Python3 Unit tests: succeeded
- 7 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 5 warnings
- 66 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13209/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-05-13T01:08:07Z

Jenkins results:

Python3 Unit tests: succeeded
- 7 tests added
- 3 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 5 warnings
- 66 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13210/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-05-13T01:44:20Z

Jenkins results:

Python3 Unit tests: succeeded
- 7 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 5 warnings
- 66 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13211/artifact/artifacts/PullRequestReport.html

amaltaro · 2022-05-13T02:04:22Z

MSTransferor level tests are looking okay, but I am letting those workflows to go through the system to see how it goes until the very end.

I think review can start though. @haozturk could you please have a look at this as well. I tried to highlight all the important changes in the initial description, but if you see any wrong/missing use case, please let me know.

vkuznet

I need to admit that review of such PR requires detailed knowledge of data management system which I don't have. Therefore, I can't comment on why so much stuff has been removed and then added to MSTransferor. Moreover, it seems that the entire structure of MSTransferor requires code-refactoring based on provided data types. Right now, the code handle everything via if/else structures, while I would expect that code itself should be very abstract, i.e. here is data and here are the rules (methods) to call over the data. While, concrete implementations of different data-types should be separated. If you'll have a classes for "Neutrino", "DQMHarvesting", "primary", etc. (whatever data-type data management will need to handle) and let these classes implement common methods, then it would be much more easier to maintain the code. In this design, the MSTransferor will only ask input data what is your type, and then call appropriate class methods which is associated with such data type. This will keep MSTransferor logic abstract and generic, while also will allow to implement different rules/methods for different types of data. Said that, I don't have concrete comments to the proposed changes as I, honestly, do not understand the logic of data transfers.

Here is how I envision decideDataPlacement method should looks like:

def __init__(self):
       self.trasferorObjects = {
            'Neutrino': NeutrinoTransferor()
             'DQMHarvester': DQMHarvestTransferor()
              ....
       }
def decideDataPlacement(self, wflow, dataIn):
       dataType, groupping = self.getDataType(wflow)
       self.trasnferorObject[dataType].decideDataPlacement(wflow, datain, groupping)

def getDataType(self, wflow):
       if wflow.getReqType() == "DQMHarvest":
           return "DQMHarvest", "ALL"
       if wflow.getOpenRunningTimeout() > self.msConfig["openRunning"]:
           return "OpenRunning", "DATASET"

And, then within this MS area (WMCore/MicroServices/MSTransferor) you'll add DQMHarvestTransferor.py, NeutrinoTransferor.py, etc. modules which will keep implementation for concrete use-cases. This will make code more maintainable and extensible since when new use-case will pop-up you'll only need to add new implementation. And, if logic for specific use-case will require changes you'll only change this class logic, while MSTransferor logic will be very generic.

amaltaro · 2022-05-13T13:18:33Z

Valentin, thanks for these comments! I will have to think whether this implementation is feasible and how to do so.

On what concerns the large code removal, I mentioned this in the initial PR review. But it was possible because much of the MSTransferor logic was still based on how we used to work with PhEDEx; while now we can delegate much of the data management logic to the actual data management system :)

todor-ivanov

Thanks @amaltaro for those huge changes here. In general the implementation looks good. Besides the few minor comments I have left inline I am having the impression that:

With those changes a big portion of the previously automated logic i s now completely dropped, and the data placement would rely completely on the campaign level configuration
The input data placement becomes much more static than it was before.
We loosen the constraints on the service side on what goes where much more than before

There are a lot of changes that go in with the current PR, which are not completely related to reorganizing the secondary location logic, but rather to code refactoring. And I know the best moment to work on something like that is when you are actually reiterating through the code, but I think it could be good to have a separate issue mentioning that refactoring, which to be resolved with the current PR as well. But if you think that would be too much of effort just skip this request of mine.

We would also benefit from a well described current policy of the service behavior as it is right now, so that we can update the relevant section in the documentation here. Which I believe we need to review together with P&R Team.

todor-ivanov · 2022-05-13T12:46:38Z

src/python/WMCore/MicroService/MSTransferor/Workflow.py

@@ -357,7 +357,7 @@ def getChunkBlocks(self, numChunks=1):
                thisChunk.update(list(self.getParentBlocks()))
                thisChunkSize += sum([blockInfo['blockSize'] for blockInfo in viewvalues(self.getParentBlocks())])
            # keep same data structure as multiple chunks, so list of lists
-            return [thisChunk], [thisChunkSize]
+            return thisChunk, thisChunkSize


If the types to be returned from here are supposed to be: a list && an integer, as stated in the docstring above, this line sound like it should be:

return list(thisChunk), thisChunkSize

because thisChunk is defined as a set I think. If what is returned here is the truth, then please correct the docstring.

todor-ivanov · 2022-05-13T14:15:10Z