Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checks for StatusAdvanceTimeout expiration and send alarms. #11299

Conversation

todor-ivanov
Copy link
Contributor

@todor-ivanov todor-ivanov commented Sep 29, 2022

Fixes #11094

Status

Ready

Description

Due to many reasons, some workflows may get stuck in some of the MSRulecleaner initial statuses i.e. announced or rejected, completed-rejected for long periods. One being the reason from the original issue - Tape Rucio rules not satisfied.

With the current change we add the proper checks for stale workflows and if a configurable timeout has expired an alarm is sent.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

services_config related changes:

https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/174
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/175
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/176

External dependencies / deployment changes

None

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 7 warnings
    • 53 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13605/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todor, I left some comments and questions along the code.

statusAdvanceTimeout = self.msConfig['statusAdvanceTimeout'] * 3600

if status == None:
status = wflow['RequestTransition'][-1]['Status']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be safer to wrap this with a try/except block.

alertName = "{}: Stale Workflow: {}".format(self.alertServiceName, wflow['RequestName'])
alertSeverity = "high"
alertSummary = "[MSRuleCleaner] Found a stale workflow."
alertDescription = "\nWorkflow: {} has exceeded the Status Advance Timeout of: {} for status: {}.".format(wflow['RequestName'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error, too few arguments.
And for the line below, I am not sure whether we have limit of number of characters for these alerts. The additionalInfo might be too much information for the alert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error, too few arguments.

I'll fix that.

The additionalInfo might be too much information for the alert.

Well, I think it is actually important. It contains the bit allowing you to distinguwish all the possible usecases which could trigger the alarm. this would shorten the debugging process significantly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but please check with Valentin and/or the CMS Monitoring on whether there are any limitations in the alert size (or any of their attributes, e.g. summary, description, etc). I remember hitting it back in the days of 100s of rule numbers in one of these alert attributes.

@@ -545,12 +597,18 @@ def archive(self, wflow):
# Make all the needed checks before trying to archive
if not (wflow['IsClean'] or wflow['ForceArchive']):
msg = "Not properly cleaned workflow: %s" % wflow['RequestName']
if self._isStatusAdvanceExpired(wflow, wflow['RequestStatus']):
self.alertStatusAdvanceExpired(wflow, msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we have this check being done only once in the method upstream (after catching these exceptions)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same at the beginning, but then we loose the additional info provided by the message. Which contains really important information about why the workflow is sitting stuck and not progressing, Meaning at which point it was ejected out of the pipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise an exception and you provide the error message. So you could catch it upstream and make it part of the error message (as we are discussing in a different comment).

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 7 warnings
    • 51 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13779/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 7 warnings
    • 51 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13781/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
    • 1 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 7 warnings
    • 51 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13780/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
    • 1 tests no longer failing
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 7 warnings
    • 54 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13782/artifact/artifacts/PullRequestReport.html

@todor-ivanov todor-ivanov force-pushed the bugfix_MSRuleCleaner_WorkflowsStuck_fix-11094 branch from 51466e6 to 8ed856a Compare December 1, 2022 13:29
@todor-ivanov todor-ivanov requested a review from amaltaro December 1, 2022 13:31
@todor-ivanov
Copy link
Contributor Author

Hi @amaltaro,

I did address your comments. The alarm was again manually tested with one old workflow. You must have received it in your mailbox as well.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todor, please find a partial review along the code.

@todor-ivanov todor-ivanov force-pushed the bugfix_MSRuleCleaner_WorkflowsStuck_fix-11094 branch from 2e8f3c4 to 79d3f1f Compare December 1, 2022 15:19
@todor-ivanov todor-ivanov requested a review from amaltaro December 1, 2022 15:19
@todor-ivanov
Copy link
Contributor Author

@amaltaro Please find the changes addressing your comments in my latest commit (already squashed, actually).

@amaltaro
Copy link
Contributor

amaltaro commented Dec 1, 2022

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 4 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 7 warnings
    • 53 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13816/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@todor-ivanov I won't be able to review it for another hour, but I wanted to leave a strong recommendation here to avoid code duplication.

Instead of using:

            if self._isStatusAdvanceExpired(wflow):
                self.alertStatusAdvanceExpired(wflow, additionalInfo=msg)

I would suggest to actually call _isStatusAdvanceExpired from inside the alertStatusAdvanceExpired method, at the very beginning.

This would give a fair simplification of these changes.

In addition to that, we could use of some basic unit tests for the two new methods.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to my comments above, I left a couple of extra comments along the code.

Todor, as we discussed yesterday, it would be useful to have the ability to enable/disable these alerts via configuration. This is the parameter we use for the other microservices (but you can default it to True such that no updates are required to the configuration file):
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSOutput/MSOutput.py#L98

@todor-ivanov todor-ivanov force-pushed the bugfix_MSRuleCleaner_WorkflowsStuck_fix-11094 branch from 6f20ae2 to e98ff86 Compare December 2, 2022 10:49
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 7 warnings
    • 53 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13821/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 7 warnings
    • 53 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13820/artifact/artifacts/PullRequestReport.html

@todor-ivanov todor-ivanov requested a review from amaltaro December 2, 2022 11:31
@todor-ivanov
Copy link
Contributor Author

@amaltaro please take another look

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 337 warnings and errors that must be fixed
    • 7 warnings
    • 65 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13822/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todor, please see comments inside the alert methods.

Rename SatusAdvanceTimeout && Change _isStatusAdvanceExpired signature && Fix log messages.

Fix Alert call.

Fix _isStatusAdvanceExpired call.

Typo

Query only for last status transition time && Fix alarm text.

Move the call to _isStatusAdvanceExpired to a single place.

Add configuration flag for enabling alarms from msConfig.

Rename _getStatusTransitionTime

Move SendNotification flag to the end of the alert mothods.
Unit tests - pylint fixes
@todor-ivanov todor-ivanov force-pushed the bugfix_MSRuleCleaner_WorkflowsStuck_fix-11094 branch from f7060c6 to 6499215 Compare December 2, 2022 12:22
@todor-ivanov todor-ivanov requested a review from amaltaro December 2, 2022 12:23
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 7 warnings
    • 65 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13823/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test failures are all unstable. Thanks for these changes Todor, they look good to me.

@amaltaro amaltaro merged commit 00ce5cd into dmwm:master Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflows stuck in announced status for a year
3 participants