-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create alert for pileup containers eligible for rule deletion #11264
Conversation
Jenkins results:
|
test this please |
Jenkins results:
|
if ruleIds and dataType in ("MCPileup", "DataPileup"): | ||
msg = "Pileup container %s has the following container-level rules to be removed: %s." | ||
msg += " However, this component is no longer removing pileup rules." | ||
self.logger.info(msg, dataCont, rule['id']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I read code correctly, the rule['id']
is not defined here. You have ruleIds
but not rule['id']
in this set of changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, Valentin. I am fixing it here and in your comment below.
self.logger.info(msg, dataCont, rule['id']) | ||
self.alertDeletablePU(wflow['RequestName'], dataCont, ruleIds) | ||
elif ruleIds: | ||
wflow['RulesToClean'][currPline].extend(rule['id']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and rule['id']
is not defined here either as you removed for loop
elif ruleIds: | ||
wflow['RulesToClean'][currPline].extend(rule['id']) | ||
msg = "Container %s has the following container-level rules to be removed: %s" | ||
self.logger.info(msg, dataCont, rule['id']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and neither here
Jenkins results:
|
alertDescription += "These rules\n{}\nare eligible for deletion.".format(ruleList) | ||
# alert to expiry in 2 days | ||
self.sendAlert(alertName, alertSeverity, alertSummary, alertDescription, | ||
service=self.alertServiceName, endSecs=2*24*60*60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not hard-code values for endSecs
and make them configurable. This will allow to easily tune up alert lifetime, and either shrink or expand it once you get more feedback from operations.
@vkuznet Valentin, I made the expiration time configurable. However, it won't be much helpful if we start adding multiple alerts to this microservice. Can you please remind me if anything needs to change either in CMSKubernetes or CMSMonitoring to get these alerts getting posted to slack and/or email? |
fix Valentins comments Make alert expiration configurable changed argument name
Jenkins results:
|
@amaltaro if you will need to deal with multiple alerts within different module(s) then you should pass expiration through the API and let caller to provide the expiration. Regarding configuration, nothing need to be changed on Prometheus side, but you should define rules for AlertManager which are defined here (in general you can find it by visiting cms-monitoring.cern.ch, then click on Documentation, then find AlertManager and follow provided link to configuration page). Said that, you need to define few stuff to make alert propagated to your channel(s):
Finally, when AM configuration is adjusted (it should be done by CMS Monitoring team member), the new configuration should be deployed to CMS monitoring clusters and AM pods should be restarted. For that I suggest you create Jira ticket. You also need to decide to which HTTP URL you'll send the alert. This is defined in your |
Jenkins results:
|
test this please |
Thank you for this information, Valentin. I have added some of this information in this wiki page (feel free to improve it if you want): Looking at the rules, I think this new notification is already covered by the general slack/email notifications created in WMCore. |
Jenkins results:
|
Fixes #11216
Status
not-tested
Description
This PR provides the following changes:
Is it backward compatible (if not, which system it affects?)
Other than the new alert, yes.
Related PRs
None
External dependencies / deployment changes
Configuration services_config changes:
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/155
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/156
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/157