-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add checks for StatusAdvanceTimeout expiration and send alarms. #11299
Add checks for StatusAdvanceTimeout expiration and send alarms. #11299
Conversation
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, I left some comments and questions along the code.
statusAdvanceTimeout = self.msConfig['statusAdvanceTimeout'] * 3600 | ||
|
||
if status == None: | ||
status = wflow['RequestTransition'][-1]['Status'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be safer to wrap this with a try/except block.
alertName = "{}: Stale Workflow: {}".format(self.alertServiceName, wflow['RequestName']) | ||
alertSeverity = "high" | ||
alertSummary = "[MSRuleCleaner] Found a stale workflow." | ||
alertDescription = "\nWorkflow: {} has exceeded the Status Advance Timeout of: {} for status: {}.".format(wflow['RequestName']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error, too few arguments.
And for the line below, I am not sure whether we have limit of number of characters for these alerts. The additionalInfo
might be too much information for the alert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error, too few arguments.
I'll fix that.
The additionalInfo might be too much information for the alert.
Well, I think it is actually important. It contains the bit allowing you to distinguwish all the possible usecases which could trigger the alarm. this would shorten the debugging process significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but please check with Valentin and/or the CMS Monitoring on whether there are any limitations in the alert size (or any of their attributes, e.g. summary, description, etc). I remember hitting it back in the days of 100s of rule numbers in one of these alert attributes.
@@ -545,12 +597,18 @@ def archive(self, wflow): | |||
# Make all the needed checks before trying to archive | |||
if not (wflow['IsClean'] or wflow['ForceArchive']): | |||
msg = "Not properly cleaned workflow: %s" % wflow['RequestName'] | |||
if self._isStatusAdvanceExpired(wflow, wflow['RequestStatus']): | |||
self.alertStatusAdvanceExpired(wflow, msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we have this check being done only once in the method upstream (after catching these exceptions)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the same at the beginning, but then we loose the additional info provided by the message. Which contains really important information about why the workflow is sitting stuck and not progressing, Meaning at which point it was ejected out of the pipeline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You raise an exception and you provide the error message. So you could catch it upstream and make it part of the error message (as we are discussing in a different comment).
bdf7c93
to
d2925d4
Compare
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
51466e6
to
8ed856a
Compare
Hi @amaltaro, I did address your comments. The alarm was again manually tested with one old workflow. You must have received it in your mailbox as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, please find a partial review along the code.
2e8f3c4
to
79d3f1f
Compare
@amaltaro Please find the changes addressing your comments in my latest commit (already squashed, actually). |
test this please |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@todor-ivanov I won't be able to review it for another hour, but I wanted to leave a strong recommendation here to avoid code duplication.
Instead of using:
if self._isStatusAdvanceExpired(wflow):
self.alertStatusAdvanceExpired(wflow, additionalInfo=msg)
I would suggest to actually call _isStatusAdvanceExpired
from inside the alertStatusAdvanceExpired
method, at the very beginning.
This would give a fair simplification of these changes.
In addition to that, we could use of some basic unit tests for the two new methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to my comments above, I left a couple of extra comments along the code.
Todor, as we discussed yesterday, it would be useful to have the ability to enable/disable these alerts via configuration. This is the parameter we use for the other microservices (but you can default it to True such that no updates are required to the configuration file):
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSOutput/MSOutput.py#L98
6f20ae2
to
e98ff86
Compare
Jenkins results:
|
Jenkins results:
|
@amaltaro please take another look |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, please see comments inside the alert methods.
Rename SatusAdvanceTimeout && Change _isStatusAdvanceExpired signature && Fix log messages. Fix Alert call. Fix _isStatusAdvanceExpired call. Typo Query only for last status transition time && Fix alarm text. Move the call to _isStatusAdvanceExpired to a single place. Add configuration flag for enabling alarms from msConfig. Rename _getStatusTransitionTime Move SendNotification flag to the end of the alert mothods.
Unit tests - pylint fixes
f7060c6
to
6499215
Compare
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit test failures are all unstable. Thanks for these changes Todor, they look good to me.
Fixes #11094
Status
Ready
Description
Due to many reasons, some workflows may get stuck in some of the
MSRulecleaner
initial statuses i.e.announced
orrejected
,completed-rejected
for long periods. One being the reason from the original issue - Tape Rucio rules not satisfied.With the current change we add the proper checks for stale workflows and if a configurable timeout has expired an alarm is sent.
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
services_config
related changes:https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/174
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/175
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/176
External dependencies / deployment changes
None