Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip Custodial and make generic Disk NonCustodial data placement in RucioInjector #9989

Merged
merged 1 commit into from
Oct 14, 2020

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Oct 13, 2020

Fixes #9639

Status

ready

Description

Summary of changes is:

  • created two new RucioInjector component configuration parameters:
    • containerDiskRuleParams: additional parameters overriding the default container-level rule attributes, only for Disk data placement
    • containerDiskRuleRSEExpr: very generic RSE expression to be used for container-level Disk rules
  • if it's a T0 agent, then proceed with standard output data placement; otherwise, proceed with central production data placement, which has the following peculiarities:
    • Custodial data placement is bypassed and the container is marked as subscribed in the relational database;
    • NonCustodial data placement overrides settings at the workflow spec, where destination is set to any T1+T2 real Disk endpoint; number of copies is set to 2, with a grouping=DATASET, and a rule weight is added as well (with "ddm_quota" value).

Is it backward compatible (if not, which system it affects?)

yes

Related PRs

To be merged after: #9988

External dependencies / deployment changes

None

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 6 warnings
    • 7 comments to review
  • Pycodestyle check: succeeded
    • 3 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10523/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

Tests on the testbed agent went fine.

@amaltaro
Copy link
Contributor Author

Backported and merged in 1.4.1.patch2

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro, I made a single comment in the code. But given that the change already went into the wmagent branch the question may easily be left behind. All the rest looks good to me.

try:
resp = self.rucio.createReplicationRule(container,
rseExpression=rseExpr, **kwargs)
except Exception as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code of createReplicationRule function, it seems that the only Exception that can be raised there and captured here is AccessDenied from [1]. All the rest are masked in [2], including the DuplicateRule few lines above. So the question is should we retry indefinitely/infinitely here expecting something to happen all by itself on the Rucio side to fix the error and give the access in question or it is worth alarming the WMCore team after few retries here that, something is wrong.

[1]

except AccessDenied as ex:

[2]

except Exception as ex:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is now only dealing with Disk data placement, through a very generic rse expression. So if we hit an AccessDenied, it should happen only once in life, until the CMS Rucio team fixes it. Given that the impact would be very clear too - not in the form of a WMAgent alert thoguh - I decided not to deal with that specific exception in this code (it remains in the T0 logic though).

That's a good question though, and ideally, we should eventually retry actions a few times (over a few different cycles) until a hard failure is triggered (possibly triggering a notification/alert too). There are many places that this behaviour would be very helpful.

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 6 warnings
    • 7 comments to review
  • Pycodestyle check: succeeded
    • 3 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10525/artifact/artifacts/PullRequestReport.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Repurpose workflow output data placement settings to deal with Rucio rules
3 participants