-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alert to MSRuleCleaner for not archived workflows. #11373
Add alert to MSRuleCleaner for not archived workflows. #11373
Conversation
test this please |
Jenkins results:
|
3021f4e
to
87630b3
Compare
Jenkins results:
|
Jenkins results:
|
Hi @amaltaro,
|
Typo Fix alarm description.
87630b3
to
940e5b4
Compare
Hi @amaltaro, you may proceed with your review. The code is ready, and I did force the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@todor-ivanov I think we need to find the best cost-benefit implementation (aka, something good enough).
This PR provides an abstract alarm, just saying "go and figure why it's stuck". In addition to that, it brings in a bunch of useless information (like the whole RequestTransition data structure).
The other PR indeed has many more LOC, but it does provide a clue of what is wrong, e.g.:
- tape transfers pending
- parentage has not been resolved
- general exception; etc
I would be in favor of following up on #11299 and make sure we have a "good enough" error message in the alert. It does not need to be complete, but it should at least give a high level cause for that stuckness. In addition, there are still quite a few comments in there that need to be worked on, so we might be able to shorten that development.
Given that the other PR implementation: has been merged, I am closing this one out. |
Fixes #11094
Supersedes #11299
Status
Ready
Description
With the current PR an alarm is added to MSRuleCleaner for throwing an alert for workflows stuck and not archived for more than a configurable amount of time. The configurable parameter should be read from
msConfig['archiveAlarmHours']
. The check should be performed only for workflows sitting inannounced
.Is it backward compatible (if not, which system it affects?)
YES
Related PRs
service_config
related changes:https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/174
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/175
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/176
External dependencies / deployment changes
None