Add checks for StatusAdvanceTimeout expiration and send alarms. #11299

todor-ivanov · 2022-09-29T17:45:28Z

Status

Ready

Description

Due to many reasons, some workflows may get stuck in some of the MSRulecleaner initial statuses i.e. announced or rejected, completed-rejected for long periods. One being the reason from the original issue - Tape Rucio rules not satisfied.

With the current change we add the proper checks for stale workflows and if a configurable timeout has expired an alarm is sent.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

services_config related changes:

https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/174
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/175
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/176

External dependencies / deployment changes

None

cmsdmwmbot · 2022-09-29T17:54:20Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 1 tests no longer failing
- 1 changes in unstable tests
Python3 Pylint check: failed
- 2 warnings and errors that must be fixed
- 7 warnings
- 53 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13605/artifact/artifacts/PullRequestReport.html

amaltaro

Todor, I left some comments and questions along the code.

src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py

amaltaro · 2022-09-30T01:11:09Z

src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py

+        statusAdvanceTimeout = self.msConfig['statusAdvanceTimeout'] * 3600
+
+        if status == None:
+            status = wflow['RequestTransition'][-1]['Status']


Might be safer to wrap this with a try/except block.

src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py

amaltaro · 2022-09-30T01:25:08Z

src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py

+        alertName = "{}: Stale Workflow: {}".format(self.alertServiceName, wflow['RequestName'])
+        alertSeverity = "high"
+        alertSummary = "[MSRuleCleaner] Found a stale workflow."
+        alertDescription = "\nWorkflow: {} has exceeded the Status Advance Timeout of: {} for status: {}.".format(wflow['RequestName'])


Error, too few arguments.
And for the line below, I am not sure whether we have limit of number of characters for these alerts. The additionalInfo might be too much information for the alert.

Error, too few arguments.

I'll fix that.

The additionalInfo might be too much information for the alert.

Well, I think it is actually important. It contains the bit allowing you to distinguwish all the possible usecases which could trigger the alarm. this would shorten the debugging process significantly.

Ok, but please check with Valentin and/or the CMS Monitoring on whether there are any limitations in the alert size (or any of their attributes, e.g. summary, description, etc). I remember hitting it back in the days of 100s of rule numbers in one of these alert attributes.

src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py

amaltaro · 2022-09-30T01:30:49Z

src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py

@@ -545,12 +597,18 @@ def archive(self, wflow):
        # Make all the needed checks before trying to archive
        if not (wflow['IsClean'] or wflow['ForceArchive']):
            msg = "Not properly cleaned workflow: %s" % wflow['RequestName']
+            if self._isStatusAdvanceExpired(wflow, wflow['RequestStatus']):
+                self.alertStatusAdvanceExpired(wflow, msg)


Can't we have this check being done only once in the method upstream (after catching these exceptions)?

I was thinking the same at the beginning, but then we loose the additional info provided by the message. Which contains really important information about why the workflow is sitting stuck and not progressing, Meaning at which point it was ejected out of the pipeline.

You raise an exception and you provide the error message. So you could catch it upstream and make it part of the error message (as we are discussing in a different comment).

cmsdmwmbot · 2022-12-01T11:38:13Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 1 tests no longer failing
- 1 changes in unstable tests
Python3 Pylint check: failed
- 10 warnings and errors that must be fixed
- 7 warnings
- 51 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13779/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-12-01T11:45:29Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 1 changes in unstable tests
Python3 Pylint check: failed
- 10 warnings and errors that must be fixed
- 7 warnings
- 51 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13781/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-12-01T11:50:41Z

Jenkins results:

Python3 Unit tests: failed
- 3 new failures
- 1 tests no longer failing
- 2 changes in unstable tests
Python3 Pylint check: failed
- 10 warnings and errors that must be fixed
- 7 warnings
- 51 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13780/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-12-01T12:21:41Z

Jenkins results:

Python3 Unit tests: failed
- 3 new failures
- 1 tests no longer failing
Python3 Pylint check: failed
- 9 warnings and errors that must be fixed
- 7 warnings
- 54 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13782/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2022-12-01T13:32:44Z

Hi @amaltaro,

I did address your comments. The alarm was again manually tested with one old workflow. You must have received it in your mailbox as well.

amaltaro

Todor, please find a partial review along the code.

src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py

todor-ivanov · 2022-12-01T15:23:05Z

@amaltaro Please find the changes addressing your comments in my latest commit (already squashed, actually).

amaltaro · 2022-12-01T19:12:53Z

test this please

cmsdmwmbot · 2022-12-01T19:33:14Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 4 changes in unstable tests
Python3 Pylint check: succeeded
- 7 warnings
- 53 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13816/artifact/artifacts/PullRequestReport.html

amaltaro

@todor-ivanov I won't be able to review it for another hour, but I wanted to leave a strong recommendation here to avoid code duplication.

Instead of using:

            if self._isStatusAdvanceExpired(wflow):
                self.alertStatusAdvanceExpired(wflow, additionalInfo=msg)

I would suggest to actually call _isStatusAdvanceExpired from inside the alertStatusAdvanceExpired method, at the very beginning.

This would give a fair simplification of these changes.

In addition to that, we could use of some basic unit tests for the two new methods.

amaltaro

In addition to my comments above, I left a couple of extra comments along the code.

Todor, as we discussed yesterday, it would be useful to have the ability to enable/disable these alerts via configuration. This is the parameter we use for the other microservices (but you can default it to True such that no updates are required to the configuration file):
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSOutput/MSOutput.py#L98

src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py

cmsdmwmbot · 2022-12-02T11:02:08Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests no longer failing
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 7 warnings
- 53 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13821/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-12-02T11:05:18Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests no longer failing
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 7 warnings
- 53 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13820/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2022-12-02T11:31:36Z

@amaltaro please take another look

cmsdmwmbot · 2022-12-02T11:40:21Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 2 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 337 warnings and errors that must be fixed
- 7 warnings
- 65 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13822/artifact/artifacts/PullRequestReport.html

amaltaro

Todor, please see comments inside the alert methods.

src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py

Rename SatusAdvanceTimeout && Change _isStatusAdvanceExpired signature && Fix log messages. Fix Alert call. Fix _isStatusAdvanceExpired call. Typo Query only for last status transition time && Fix alarm text. Move the call to _isStatusAdvanceExpired to a single place. Add configuration flag for enabling alarms from msConfig. Rename _getStatusTransitionTime Move SendNotification flag to the end of the alert mothods.

Unit tests - pylint fixes

cmsdmwmbot · 2022-12-02T12:41:29Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 1 tests no longer failing
- 2 tests added
- 1 changes in unstable tests
Python3 Pylint check: succeeded
- 7 warnings
- 65 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13823/artifact/artifacts/PullRequestReport.html

amaltaro

Unit test failures are all unstable. Thanks for these changes Todor, they look good to me.

todor-ivanov requested a review from amaltaro September 29, 2022 17:45

amaltaro requested changes Sep 30, 2022

View reviewed changes

amaltaro added the PR: Do not merge yet label Sep 30, 2022

todor-ivanov mentioned this pull request Nov 29, 2022

Add alert to MSRuleCleaner for not archived workflows. #11373

Closed

todor-ivanov force-pushed the bugfix_MSRuleCleaner_WorkflowsStuck_fix-11094 branch from bdf7c93 to d2925d4 Compare December 1, 2022 11:29

todor-ivanov removed the PR: Do not merge yet label Dec 1, 2022

todor-ivanov force-pushed the bugfix_MSRuleCleaner_WorkflowsStuck_fix-11094 branch from 51466e6 to 8ed856a Compare December 1, 2022 13:29

todor-ivanov requested a review from amaltaro December 1, 2022 13:31

amaltaro requested changes Dec 1, 2022

View reviewed changes

todor-ivanov force-pushed the bugfix_MSRuleCleaner_WorkflowsStuck_fix-11094 branch from 2e8f3c4 to 79d3f1f Compare December 1, 2022 15:19

todor-ivanov requested a review from amaltaro December 1, 2022 15:19

amaltaro requested changes Dec 1, 2022

View reviewed changes

amaltaro requested changes Dec 2, 2022

View reviewed changes

src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py Outdated Show resolved Hide resolved

src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py Outdated Show resolved Hide resolved

todor-ivanov force-pushed the bugfix_MSRuleCleaner_WorkflowsStuck_fix-11094 branch from 6f20ae2 to e98ff86 Compare December 2, 2022 10:49

todor-ivanov requested a review from amaltaro December 2, 2022 11:31

amaltaro requested changes Dec 2, 2022

View reviewed changes

src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py Outdated Show resolved Hide resolved

src/python/WMCore/MicroService/MSRuleCleaner/MSRuleCleaner.py Outdated Show resolved Hide resolved

Unit tests

6499215

Unit tests - pylint fixes

todor-ivanov force-pushed the bugfix_MSRuleCleaner_WorkflowsStuck_fix-11094 branch from f7060c6 to 6499215 Compare December 2, 2022 12:22

todor-ivanov requested a review from amaltaro December 2, 2022 12:23

amaltaro approved these changes Dec 2, 2022

View reviewed changes

amaltaro merged commit 00ce5cd into dmwm:master Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add checks for StatusAdvanceTimeout expiration and send alarms. #11299

Add checks for StatusAdvanceTimeout expiration and send alarms. #11299

todor-ivanov commented Sep 29, 2022 •

edited

Loading

cmsdmwmbot commented Sep 29, 2022

amaltaro left a comment

amaltaro Sep 30, 2022

amaltaro Sep 30, 2022

todor-ivanov Sep 30, 2022

amaltaro Sep 30, 2022

amaltaro Sep 30, 2022

todor-ivanov Sep 30, 2022

amaltaro Sep 30, 2022

cmsdmwmbot commented Dec 1, 2022

cmsdmwmbot commented Dec 1, 2022

cmsdmwmbot commented Dec 1, 2022

cmsdmwmbot commented Dec 1, 2022

todor-ivanov commented Dec 1, 2022

amaltaro left a comment

todor-ivanov commented Dec 1, 2022

amaltaro commented Dec 1, 2022

cmsdmwmbot commented Dec 1, 2022

amaltaro left a comment

amaltaro left a comment

cmsdmwmbot commented Dec 2, 2022

cmsdmwmbot commented Dec 2, 2022

todor-ivanov commented Dec 2, 2022

cmsdmwmbot commented Dec 2, 2022

amaltaro left a comment

cmsdmwmbot commented Dec 2, 2022

amaltaro left a comment

Add checks for StatusAdvanceTimeout expiration and send alarms. #11299

Add checks for StatusAdvanceTimeout expiration and send alarms. #11299

Conversation

todor-ivanov commented Sep 29, 2022 • edited Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented Sep 29, 2022

amaltaro left a comment

Choose a reason for hiding this comment

amaltaro Sep 30, 2022

Choose a reason for hiding this comment

amaltaro Sep 30, 2022

Choose a reason for hiding this comment

todor-ivanov Sep 30, 2022

Choose a reason for hiding this comment

amaltaro Sep 30, 2022

Choose a reason for hiding this comment

amaltaro Sep 30, 2022

Choose a reason for hiding this comment

todor-ivanov Sep 30, 2022

Choose a reason for hiding this comment

amaltaro Sep 30, 2022

Choose a reason for hiding this comment

cmsdmwmbot commented Dec 1, 2022

cmsdmwmbot commented Dec 1, 2022

cmsdmwmbot commented Dec 1, 2022

cmsdmwmbot commented Dec 1, 2022

todor-ivanov commented Dec 1, 2022

amaltaro left a comment

Choose a reason for hiding this comment

todor-ivanov commented Dec 1, 2022

amaltaro commented Dec 1, 2022

cmsdmwmbot commented Dec 1, 2022

amaltaro left a comment

Choose a reason for hiding this comment

amaltaro left a comment

Choose a reason for hiding this comment

cmsdmwmbot commented Dec 2, 2022

cmsdmwmbot commented Dec 2, 2022

todor-ivanov commented Dec 2, 2022

cmsdmwmbot commented Dec 2, 2022

amaltaro left a comment

Choose a reason for hiding this comment

cmsdmwmbot commented Dec 2, 2022

amaltaro left a comment

Choose a reason for hiding this comment

todor-ivanov commented Sep 29, 2022 •

edited

Loading