Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create alert for pileup containers eligible for rule deletion #11264

Merged
merged 1 commit into from
Aug 31, 2022

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Aug 31, 2022

Fixes #11216

Status

not-tested

Description

This PR provides the following changes:

  • pileup container rules are no longer removed by MSRuleCleaner
  • instead, it will create a new alert - to be routed to Slack and email - mentioning that workflow AAA has a container BBB eligible for deletion, also providing a list of rule ids. This alert is supposed to expiry within 2 days.
  • in addition to that, the AlertManager object has been moved to the super class (MSCore), avoiding code duplicating in the microservices.

Is it backward compatible (if not, which system it affects?)

Other than the new alert, yes.

Related PRs

None

External dependencies / deployment changes

Configuration services_config changes:
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/155
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/156
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/157

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 30 warnings
    • 209 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 23 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13561/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 30 warnings
    • 209 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 23 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13562/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro requested review from todor-ivanov, khurtado and vkuznet and removed request for khurtado August 31, 2022 15:27
if ruleIds and dataType in ("MCPileup", "DataPileup"):
msg = "Pileup container %s has the following container-level rules to be removed: %s."
msg += " However, this component is no longer removing pileup rules."
self.logger.info(msg, dataCont, rule['id'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I read code correctly, the rule['id'] is not defined here. You have ruleIds but not rule['id'] in this set of changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, Valentin. I am fixing it here and in your comment below.

self.logger.info(msg, dataCont, rule['id'])
self.alertDeletablePU(wflow['RequestName'], dataCont, ruleIds)
elif ruleIds:
wflow['RulesToClean'][currPline].extend(rule['id'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and rule['id'] is not defined here either as you removed for loop

elif ruleIds:
wflow['RulesToClean'][currPline].extend(rule['id'])
msg = "Container %s has the following container-level rules to be removed: %s"
self.logger.info(msg, dataCont, rule['id'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and neither here

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 30 warnings
    • 209 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 23 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13563/artifact/artifacts/PullRequestReport.html

alertDescription += "These rules\n{}\nare eligible for deletion.".format(ruleList)
# alert to expiry in 2 days
self.sendAlert(alertName, alertSeverity, alertSummary, alertDescription,
service=self.alertServiceName, endSecs=2*24*60*60)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not hard-code values for endSecs and make them configurable. This will allow to easily tune up alert lifetime, and either shrink or expand it once you get more feedback from operations.

@amaltaro
Copy link
Contributor Author

@vkuznet Valentin, I made the expiration time configurable. However, it won't be much helpful if we start adding multiple alerts to this microservice.

Can you please remind me if anything needs to change either in CMSKubernetes or CMSMonitoring to get these alerts getting posted to slack and/or email?

@amaltaro amaltaro requested a review from vkuznet August 31, 2022 17:13
fix Valentins comments

Make alert expiration configurable

changed argument name
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
    • 1 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 30 warnings
    • 209 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13564/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor

vkuznet commented Aug 31, 2022

@amaltaro if you will need to deal with multiple alerts within different module(s) then you should pass expiration through the API and let caller to provide the expiration.

Regarding configuration, nothing need to be changed on Prometheus side, but you should define rules for AlertManager which are defined here (in general you can find it by visiting cms-monitoring.cern.ch, then click on Documentation, then find AlertManager and follow provided link to configuration page). Said that, you need to define few stuff to make alert propagated to your channel(s):

  • define routes to match your alert tag, e.g. wmcore tag will redirect alert to dmwm-admins receiver
    • please note that we have different routes, one for general alerts and another for cmsweb, you should decide where your alert will fall, I think it should come into general category
  • define receiver of alert which can contains emails or slack channel
  • if you intend to use slack channel you need to define it with CMSMonitoring slack group (send Jira request to CMSMonitoring to create new channel)
    • for some channels we provide send_resolved: true flag which will send notification when conditions are cleared. Since you manually create alerts (outside of Prometheus metrics) you DO NOT need this flag.

Finally, when AM configuration is adjusted (it should be done by CMS Monitoring team member), the new configuration should be deployed to CMS monitoring clusters and AM pods should be restarted. For that I suggest you create Jira ticket. You also need to decide to which HTTP URL you'll send the alert. This is defined in your alertManagerUrl. Please note, we have 3 clusters, the main one and two HA ones. In each clusters we have Prometheus and AM. As such you may use default one, or if you want HA mode you may send it to HA URLs as well.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 30 warnings
    • 209 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13565/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

test this please

@amaltaro
Copy link
Contributor Author

Thank you for this information, Valentin. I have added some of this information in this wiki page (feel free to improve it if you want):
https://github.com/dmwm/WMCore/wiki/Prometheus-AlertManager-wrapper-APIs#alert-routing

Looking at the rules, I think this new notification is already covered by the general slack/email notifications created in WMCore.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 30 warnings
    • 209 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13566/artifact/artifacts/PullRequestReport.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MSRuleCleaner shouldn't delete pileup input rules during active campaign period
3 participants