Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MSTransferor] Lock the whole input container for growing workflows #11141

Merged
merged 1 commit into from
May 12, 2022

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented May 12, 2022

Fixes #11130

Status

ready

Description

With this PR, we have:

  • a new microservice configuration called openRunning, used to define when to consider a workflow with growing input dataset or not (default value has been set to 7 days)
  • when an open running workflow is found, make a Rucio rule for the whole input container with grouping DATASET, with a single copy, at a logical OR of all the RSEs matching the workflow+campaign+quota

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

Configuration changes in:
Prod: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/145
Preprod: https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/146

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests deleted
    • 11 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 5 warnings
    • 73 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 22 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13201/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests deleted
    • 11 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 5 warnings
    • 77 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 22 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13202/artifact/artifacts/PullRequestReport.html

Change how dids are defined and improve logging

fix lazy logging
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 7 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 5 warnings
    • 76 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 22 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13203/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro requested review from todor-ivanov and vkuznet May 12, 2022 02:52
@amaltaro
Copy link
Contributor Author

@todor-ivanov @vkuznet if you find any critical problems with this PR, then please leave it open and I will address it once I get up. Otherwise, please merge it and make a new WMCore release (apparently 2.0.3.pre7).

If you have minor suggestions or anything else that is not a blocker, please leave those in a review and I will get those fixed as I work on #10975, which I plan to get done tomorrow as well.

I have just updated this GitLab request with all the configuration changes that we need. It is still missing the usual cmsdist PR. Once you have that ready, please update the ticket such that Imran can proceed with the testbed upgrade. Thanks a lot!

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alan, Thanks for making those changes. I have left few comments bellow. Honestly initially I was thinking, given the urgency of creating the next tag for the validation process of the new release, to approve the PR and upon that we resolve the naming and other little concerns later with the other related Changes. But upon rethinking, since there will also be no workflows configured to use the variables in question during this validation, I actually decided to have the discussion resolved before we merge. Sorry for the extra delay. But I really find this misleading.

Retrieve the OpenRunningTimeout parameter for this workflow
:return: an integer with the amount of secs
"""
return self.data.get("OpenRunningTimeout", 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a single question here. How should the default value of 0 seconds be interpreted from the caller side. should it be: Timeout Reached, or No Timeout set at all. Maybe also putting this in the docstring would be helpful together with the exact meaning of the OpenRuningTimeout itself.

@@ -477,3 +477,10 @@ def getWorkflowGroup(self):
if self.isRelVal():
return "relval"
return "production"

def getOpenRunningTimeout(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way how this method is called (and the relative workflow and ms-configuration variables) is quite confusing. It gives the feeling that it is related to a time spent in running-open. I myself have fallen into that trap when I first read the PR here. I know - MSTransferor does not deal with running-open states, but just from reading the configuration even, one is leaned towards searching for actual meaning of those variables in another component.

if subLevel == "container":
grouping = "ALL"
dids = [dataIn['name']]
elif wflow.getOpenRunningTimeout() > self.msConfig["openRunning"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned bellow (in the actual method definition), just reordering of Open and Running in the naming of this method here in order to not match or resemble another state in the system, makes it confusing. Same for the configuration variable name. It gives the totally wrong impression how this mechanism is triggered.

And why actually do we set a Timeout in the workflow configuration and then we compare it with a statically configured variable for the whole service just to trigger a given behaviour. Having something called Timeout in the workflow description makes me think this is a maximum timeout for a workflow to be in a given state, upon which an action would be taken. While, IIUC, the way how it is used here is not like that. It is more like - if the one who configured the workflow have set that it will have some tolerance of not 'exactly' triggering a service behavior..... which tolerance would be shortened or prolonged by a comparison between the workflow and the the service configuration variable. Honestly I do not understand why wouldn't it be just a Bool flag in the workflow.

@todor-ivanov
Copy link
Contributor

todor-ivanov commented May 12, 2022

Actually I am missing the part that this is the final bit of resolving #11048 which we really need to have it in the system this month.
@amaltaro I am merging it now and creating the new tag. Please follow the comment I have left with a new PR or issue if needed. @vkuznet please do leave your comments as well.

@todor-ivanov todor-ivanov merged commit 84272e1 into dmwm:master May 12, 2022
@amaltaro
Copy link
Contributor Author

That's a good comment, Todor.

Originally, this request spec attribute OpenRunningTimeout was implemented to keep workflows in running-open status (workflow status) for that period of time, such that new files would be tried to be added to workqueue elements with open DBS blocks (DBS2 era).

With time and migration to DBS3, we started slowly deprecating that functionality - since DBS3 only contains closed blocks! - and the request status was also moved to a cherrypy thread (I think this is the moment that we lost this coupling).

So there is a minor change of meaning for this OpenRunningTimeout attribute, it's no longer restricted to running-open status and it no longer deals with open blocks in DBS. Instead, while OpenRunningTimeout hasn't been reached in the global WQ inbox element (time since the last block has been added to the global workqueue elements), global workqueue will keep running input data discovery and will create WQ elements whenever a new block is inserted into Rucio/DBS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MS] Create container level rules for primary input data in growing datasets
3 participants