-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MSTransferor] Create rule for each pileup location #11143
Conversation
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
MSTransferor level tests are looking okay, but I am letting those workflows to go through the system to see how it goes until the very end. I think review can start though. @haozturk could you please have a look at this as well. I tried to highlight all the important changes in the initial description, but if you see any wrong/missing use case, please let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to admit that review of such PR requires detailed knowledge of data management system which I don't have. Therefore, I can't comment on why so much stuff has been removed and then added to MSTransferor. Moreover, it seems that the entire structure of MSTransferor requires code-refactoring based on provided data types. Right now, the code handle everything via if/else structures, while I would expect that code itself should be very abstract, i.e. here is data and here are the rules (methods) to call over the data. While, concrete implementations of different data-types should be separated. If you'll have a classes for "Neutrino", "DQMHarvesting", "primary", etc. (whatever data-type data management will need to handle) and let these classes implement common methods, then it would be much more easier to maintain the code. In this design, the MSTransferor will only ask input data what is your type
, and then call appropriate class methods which is associated with such data type. This will keep MSTransferor logic abstract and generic, while also will allow to implement different rules/methods for different types of data. Said that, I don't have concrete comments to the proposed changes as I, honestly, do not understand the logic of data transfers.
Here is how I envision decideDataPlacement
method should looks like:
def __init__(self):
self.trasferorObjects = {
'Neutrino': NeutrinoTransferor()
'DQMHarvester': DQMHarvestTransferor()
....
}
def decideDataPlacement(self, wflow, dataIn):
dataType, groupping = self.getDataType(wflow)
self.trasnferorObject[dataType].decideDataPlacement(wflow, datain, groupping)
def getDataType(self, wflow):
if wflow.getReqType() == "DQMHarvest":
return "DQMHarvest", "ALL"
if wflow.getOpenRunningTimeout() > self.msConfig["openRunning"]:
return "OpenRunning", "DATASET"
And, then within this MS area (WMCore/MicroServices/MSTransferor
) you'll add DQMHarvestTransferor.py
, NeutrinoTransferor.py
, etc. modules which will keep implementation for concrete use-cases. This will make code more maintainable and extensible since when new use-case will pop-up you'll only need to add new implementation. And, if logic for specific use-case will require changes you'll only change this class logic, while MSTransferor logic will be very generic.
Valentin, thanks for these comments! I will have to think whether this implementation is feasible and how to do so. On what concerns the large code removal, I mentioned this in the initial PR review. But it was possible because much of the MSTransferor logic was still based on how we used to work with PhEDEx; while now we can delegate much of the data management logic to the actual data management system :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amaltaro for those huge changes here. In general the implementation looks good. Besides the few minor comments I have left inline I am having the impression that:
- With those changes a big portion of the previously automated logic i s now completely dropped, and the data placement would rely completely on the campaign level configuration
- The input data placement becomes much more static than it was before.
- We loosen the constraints on the service side on what goes where much more than before
There are a lot of changes that go in with the current PR, which are not completely related to reorganizing the secondary location logic, but rather to code refactoring. And I know the best moment to work on something like that is when you are actually reiterating through the code, but I think it could be good to have a separate issue mentioning that refactoring, which to be resolved with the current PR as well. But if you think that would be too much of effort just skip this request of mine.
We would also benefit from a well described current policy of the service behavior as it is right now, so that we can update the relevant section in the documentation here. Which I believe we need to review together with P&R Team.
@@ -357,7 +357,7 @@ def getChunkBlocks(self, numChunks=1): | |||
thisChunk.update(list(self.getParentBlocks())) | |||
thisChunkSize += sum([blockInfo['blockSize'] for blockInfo in viewvalues(self.getParentBlocks())]) | |||
# keep same data structure as multiple chunks, so list of lists | |||
return [thisChunk], [thisChunkSize] | |||
return thisChunk, thisChunkSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the types to be returned from here are supposed to be: a list
&& an integer
, as stated in the docstring above, this line sound like it should be:
return list(thisChunk), thisChunkSize
because thisChunk
is defined as a set I think. If what is returned here is the truth, then please correct the docstring.
:return: a string with the final pileup destination PNN | ||
""" | ||
# FIXME: workflows should be marked as failed if there is no common | ||
# site between SiteWhitelist and secondary location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I believe still holds, regardless from the fact where the PU location comes from - either campaign level configuration or workflow description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not marked as failed, but there is a Prometheus alert. This way people can react and get it through the system.
campSecLocations = campConfig['Secondaries'].get(dsetName, []) | ||
campBasedLocation.append(set(campSecLocations)) | ||
|
||
if not campSecLocations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems now we rely completely on the campaign level secondaries configuration. I'd say this may lead to Human errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how it was before, but now we have N replicas of the pileup, where N is the number of locations in the campaign. Before, we had a single replica enforced (but from discussions with Hasan, they were manually creating replicas for all the other locations. So, I'd say we are at the same state, just with more automation now :-D
|
||
if not campSecLocations: | ||
msg = "Workflow has been incorrectly assigned: %s. The secondary dataset: %s, " | ||
msg += "belongs to the campaign: %s, with does not define the secondary " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: with
-> which
@@ -352,221 +330,80 @@ def checkDataLocation(self, wflow): | |||
self.logger.info("Request %s has %d final blocks from %s", | |||
wflow.getName(), len(getattr(wflow, methodName)()), methodName) | |||
|
|||
def _checkPrimaryDataVolume(self, wflow, wflowPnns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this volume estimation method is no longer called, but I am not sure it was completely useless. The only issue with that that was making it somehow constrained in its work was the single PNN returned. It could have been returning the weighted list of all the PNNs having any piece of the dataset using a normalized value of the volume as a weight. But anyway ... this was a side comment. Such approach would completely change many other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In very short, we provide a list of RSEs to Rucio and let Rucio do the data management job.
# figure out to configure the rucio rule | ||
dids, didsSize, grouping = self.decideDataPlacement(wflow, dataIn) | ||
if not dids and dataIn["type"] == "primary": | ||
# no valid files in any blocks, it will likely fail in global workqueue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are changing blocks
to dids
in the code, please do the equivalent in the comment.
listBlockSets, listSetsSize = wflow.getChunkBlocks() | ||
msg = f"Placing {len(listBlockSets)} blocks ({gigaBytes(listSetsSize)} GB), " | ||
msg += f"with grouping: {grouping} for DQMHarvest workflow." | ||
elif wflow.getOpenRunningTimeout() > self.msConfig["openRunning"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not comfortable with this openRunningTimeout
naming. If we know all the places where this is used, we'd better change it to something that reflects its real purpose.
@vkuznet Valentin, could you please have a quick look at my 3rd commit. It's very much raw, but I just wanted to make it available to see whether that is more or less what you were suggesting above? If so, then I can proceed with this. With that code in mind, each class would behave slightly different and also provide the final setup for Rucio rules. There are still a few cases where we have all sort of data in the same workflow, and in that case it's hard to classify anything! Thanks for your review, Todor. I apologize again for providing so much changes with something that was supposed to be small, but I thought the small changes would only make things more complex and ugly, reason why I made this substantial refactoring. |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it is proper architecture, and each Workflow has a set of common methods. Then, individual details will be delegated to specific classes. I only found one issue with return and what kind of return it should be.
Thanks for the confirmation, I just wanted to make sure I was on the right direction. I will work on the remaining implementation for Monday or so. |
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
@vkuznet I think I have addressed the service structure concern that you had. It made the MSTransferor code even smaller now and hopefully more maintainable. Todor, Valentin, I still have to run some real tests with this code in, but I think it's ready for another code review. Note that none of the logic has been (intentionally changed) and all the features/changes mentioned in the PR description are still valid. |
Jenkins results:
|
Jenkins results:
|
@todor-ivanov @vkuznet I think I addressed most of your comments, other than making the grouping setting configurable. I'd rather leave that one to the future (if it's really needed). The last 2 commits have those recent changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amaltaro
It looks good to me.
Jenkins results:
|
I took the opportunity to change a few logging level here and there, and finally disabled the |
fix data returned from getChunkBlocks method do not make it a list of a set special case for secondaries in RelVals another bugfix for the secondary relval case remove confusing log about location used
Jenkins results:
|
unit tests are failing with while contacting DBS, e.g.:
@vkuznet I know you are working on a new release, ping'ing you just so you are aware of this. |
test this please |
Jenkins results:
|
sorry, I didn't update configuration file (doing it now). Things should be back to normal now. |
test this please |
Jenkins results:
|
complete code reorganization fix getInputSecData and GrowingWorkflow remove dual space from RequestInfo remove useless init method from sub-classes remove duplicate Workflow instantiation in RequestInfo apply Todor and Valentin suggestions change a few log level line from debug to info remove unused import
more unit tests unit tests with the Workflow class relocation bunch of new unit tests for the new templates update unit tests with the getChunkBlocks removal remove getChunkBlocks from unit tests as well
Jenkins results:
|
Thank you very much for the review and great suggestions, Valentin and Todor. |
Fixes #10975
Status
ready
Description
The initial objective with this PR was the following:
However, I decided to refactor many things that were in place to deal with the old PhEDEx-based data placement. A summary of those changes are:
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
Given that it modifies a recent feature introduced with: #11141
it needs to be properly tested again.
External dependencies / deployment changes
Service configuration disabling verbose logs:
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/147
and
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/148