Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alert to MSRuleCleaner for not archived workflows. #11373

Conversation

todor-ivanov
Copy link
Contributor

@todor-ivanov todor-ivanov commented Nov 28, 2022

Fixes #11094
Supersedes #11299

Status

Ready

Description

With the current PR an alarm is added to MSRuleCleaner for throwing an alert for workflows stuck and not archived for more than a configurable amount of time. The configurable parameter should be read from msConfig['archiveAlarmHours']. The check should be performed only for workflows sitting in announced.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

service_config related changes:

https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/174
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/175
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/176

External dependencies / deployment changes

None

@todor-ivanov
Copy link
Contributor Author

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 7 warnings
    • 42 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13758/artifact/artifacts/PullRequestReport.html

@todor-ivanov todor-ivanov force-pushed the bugfix_WorflowsStuckAnnounedAlarm_fix-11094 branch from 3021f4e to 87630b3 Compare November 29, 2022 13:52
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 7 warnings
    • 42 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13759/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 2 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 7 warnings
    • 42 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13760/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor Author

Hi @amaltaro,
Could you take a look at this PR. This one you may like better than #11299. It is definitely shorter and simpler - hence easier to maintain. There are only two drawback/differences from the previous one:

  • We do not have a detailed log messages with the alarm and in the logs explaining verbally the reason why a workflow is stuck and not archived but instead we print out the whole MSRulcleaner workflow description in the alarm and let the user to analyze the issue to make his own conclusion. That should be fine I think.
  • We cannot cover the cases when the reason for the workflow to be stuck is a constant failure at the step of being cleaned from LogDB. But this should also be ok as well.

@todor-ivanov todor-ivanov force-pushed the bugfix_WorflowsStuckAnnounedAlarm_fix-11094 branch from 87630b3 to 940e5b4 Compare November 29, 2022 16:20
@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Nov 29, 2022

Hi @amaltaro, you may proceed with your review. The code is ready, and I did force the archive pipeline to be executed with one stale workflow from testbed and the proper alarm did reache its final destination. I think you should be able to find it in your mail as well.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@todor-ivanov I think we need to find the best cost-benefit implementation (aka, something good enough).
This PR provides an abstract alarm, just saying "go and figure why it's stuck". In addition to that, it brings in a bunch of useless information (like the whole RequestTransition data structure).

The other PR indeed has many more LOC, but it does provide a clue of what is wrong, e.g.:

  • tape transfers pending
  • parentage has not been resolved
  • general exception; etc

I would be in favor of following up on #11299 and make sure we have a "good enough" error message in the alert. It does not need to be complete, but it should at least give a high level cause for that stuckness. In addition, there are still quite a few comments in there that need to be worked on, so we might be able to shorten that development.

@amaltaro
Copy link
Contributor

Given that the other PR implementation:
#11299

has been merged, I am closing this one out.

@amaltaro amaltaro closed this Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflows stuck in announced status for a year
3 participants