-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip Custodial and make generic Disk NonCustodial data placement in RucioInjector #9989
Conversation
Jenkins results:
|
Tests on the testbed agent went fine. |
Backported and merged in 1.4.1.patch2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaltaro, I made a single comment in the code. But given that the change already went into the wmagent branch the question may easily be left behind. All the rest looks good to me.
try: | ||
resp = self.rucio.createReplicationRule(container, | ||
rseExpression=rseExpr, **kwargs) | ||
except Exception as exc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code of createReplicationRule
function, it seems that the only Exception that can be raised there and captured here is AccessDenied
from [1]. All the rest are masked in [2], including the DuplicateRule
few lines above. So the question is should we retry indefinitely/infinitely here expecting something to happen all by itself on the Rucio side to fix the error and give the access in question or it is worth alarming the WMCore team after few retries here that, something is wrong.
[1]
except AccessDenied as ex: |
[2]
except Exception as ex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is now only dealing with Disk data placement, through a very generic rse expression. So if we hit an AccessDenied
, it should happen only once in life, until the CMS Rucio team fixes it. Given that the impact would be very clear too - not in the form of a WMAgent alert thoguh - I decided not to deal with that specific exception in this code (it remains in the T0 logic though).
That's a good question though, and ideally, we should eventually retry actions a few times (over a few different cycles) until a hard failure is triggered (possibly triggering a notification/alert too). There are many places that this behaviour would be very helpful.
Jenkins results:
|
Fixes #9639
Status
ready
Description
Summary of changes is:
containerDiskRuleParams
: additional parameters overriding the default container-level rule attributes, only for Disk data placementcontainerDiskRuleRSEExpr
: very generic RSE expression to be used for container-level Disk rulesIs it backward compatible (if not, which system it affects?)
yes
Related PRs
To be merged after: #9988
External dependencies / deployment changes
None