Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSRuleCleaner shouldn't delete pileup input rules during active campaign period #11216

Closed
haozturk opened this issue Jul 14, 2022 · 7 comments · Fixed by #11264
Closed

MSRuleCleaner shouldn't delete pileup input rules during active campaign period #11216

haozturk opened this issue Jul 14, 2022 · 7 comments · Fixed by #11264

Comments

@haozturk
Copy link

Impact of the new feature
MSRuleCleaner

Is your feature request related to a problem? Please describe.

Yes. MSRuleCleaner deletes the input rules if there is no active workflow using it. This logic isn't suitable for pileup input rules. Because, pileup placements are often quite challenging due their size and the sites that can host them and often times their wmcore_transferor rules get deleted by MSRuleCleaner when the campaign is on pause, i.e. it doesn't have any workflow at a given time and we need to place them again when the campaign resumes. We experienced this 4 or 5 times of the recent PhaseII campaign, which required quite a manual effort.

Describe the solution you'd like
MSRuleCleaner shouldn't delete pileup input rules during the active campaign period. I define the active campaign period as follows:

The time from the campaign creation to campaign deletion

We shouldn't confuse the campaign creation and deletion w/ enabling and disabling a campaign. Enabling and disabling a campaign is managed by the go parameter of unified campaign configuration and it controls the workflow assignment. We might need a pileup when the campaign is disabled. For instance, during the campaign testing period (pilot period), the pileup needs to be on disk for pilot testing and the campaign stays disabled until the pilot succeeds. Because of this reason, I suggested campaign creation and deletion, which simply means to create and delete the WMCore campaign config, e.g. [1]. P&R is responsible to create and delete campaigns and we should be able to control pileup lifetime through campaigns.

Note that: we haven't been giving enough attention to delete campaigns from WMCore DB. So, this is feature request gets implemented, we'll need to be very attentive towards campaign deletions.

[1] https://cmsweb.cern.ch/reqmgr2/data/campaignconfig/Commissioning2020

Describe alternatives you've considered
No, happy to hear alternative ideas.

Additional context
You can read this ticket to understand the recent trouble that we had: https://its.cern.ch/jira/browse/CMSCOMPPR-25865

@amaltaro
Copy link
Contributor

Thank you for creating this issue, Hasan.

As I read it, I have the impression that we could end up with pileup rules remaining in the system forever, since a campaign is not removed while there are active workflows in the system. In other words, pileup would only be eligible for deletion once all the workflows are out of the system.

We might have to add something else in front of this development, and that would be a redesign of the campaign document stored in WMCore (also including a "enabled: true/false"). We must also consider this ticket: #11034

@haozturk
Copy link
Author

Yes, perhaps an alert can help which will trigger when there is no active workflow reading the PU. Something like:

There is no active workflow in the system which reads the pileup X under campaign Y,Z. Please consider deleting these campaigns unless they need to be resumed

@amaltaro
Copy link
Contributor

As it has been discussed over the last few days, we are considering to stop pileup rule cleanup in MSRuleCleaner. Long-term development is discussed in a couple of other tickets, involving a campaign redesign and probably a new microservice.

For this issue, I would suggest to:

  • setup alerts/routing within CMS Prometheus
  • whenever we find a pileup data (aka secondary data) eligible to be removed, to skip it and instead make a prometheus notification (email and slack), as it's used in other microservices.
  • the notification should contain:
    • the workflow name
    • the pileup name
    • if easy, a list of rule ids

@vkuznet Valentin, if you are still working tomorrow, do you think you could try to work on this before you leave on holidays?

@vkuznet
Copy link
Contributor

vkuznet commented Aug 30, 2022

@amaltaro , I still have tasks (logging and monitoring) to finish with DBSMigrate/Migration servers, and I rather spent time on them. This issue is new to me in all aspects as I do not have enough knowledge where things should be implemented, on which conditions, how to test them, etc. Plus your plan requires careful overview as things can go wild if create of alerts will scale up with number of workflows. As such the alert system may be overloaded (not in terms of resources, this we can scale up horizontally, but rather in terms of volume of alerts). It is unclear to me that Prometheus is right tool for that. For instance, the alerts in Prometheus cannot be removed, they can only expire, but there is no API to delete them as they assigned to specific timestamp and duration. So far neither the description or your proposal gives me enough information to understand the whole flow and details to decide if Prometheus is a write choice. Instead, the NATS publication/subscription model may be more suitable for this use-case. For example, some workflows can publish the data, while others can subscribe to it and perform certain action (e.g. send notification). Therefore, even the issue is marked as High Priority I think proper thought about correct design and architecture is required here.

@amaltaro
Copy link
Contributor

@vkuznet Valentin, if I understand these alerts well, a new alert does not get created if there exists one alert already in the system (alert name + severity + service + tag(?)). How long an alert lives depend on the expiration time, in this case it could be defined to a day or two to avoid triggering another alert.

Workflows go only once through the MSRuleCleaner algorithm - in most of the cases - so we shouldn't be getting more than 1 alert per workflow, at tops!

The alert wrapper API has been implemented in this module:

and here is an example of how we use it in MSTransferor:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSTransferor/MSTransferor.py#L572-L584

NATS is not the way forward here! We haven't integrated NATS in WMCore and I do not want to have yet another monitoring tool within WMCore.

With this information in hand, it should be pretty straight forward to implement it. We might have to create a routing rule within CMSMonitoring though, which is likely going to take longer than implementing these changes in WMCore. If you still need more information, please let me know.

@vkuznet
Copy link
Contributor

vkuznet commented Aug 30, 2022

Alan, you do not need to explain me how to create alerts, at the end it is REST API, and WMCore just wraps it. What I asked is different. Let me break it down to individual topics:

  • data volume, e.g. number of alert per time unit
    • how may per time interval, the frequency, etc
    • expected total number of alerts in a system, e.g. O(100), O(10K)
    • based on historical usage of the system what is an approximate range of workflows at any given time
  • alert polices
    • you mention that time can be day or two, but we need more precise definition
    • naming conventions, e.g. what is a list of rule ids (which rules, what are those ids, how many)?
  • routing rules, where alerts should be routed, e.g. regexp patterns to use for alert routing, etc.
  • where alerts should be defined, please clearly stay which WMCore module and how
    • how such module should read info about workflows
    • the workflow data-structure to parse to create alert, i.e. the schema of data

None of this information is described in a ticket (it may be well known to you but not to me or may be others). Once this information will clearly provided we can discuss if it is suitable for Prometheus or should we use another system. May be we need to setup dedicated Prometheus server and dedicated AM for this to distinguish this with CMS Monitoring metrics, etc.

@amaltaro
Copy link
Contributor

MSRuleCleaner is a microservice that process workflows in status closed-out or announced. Its implementation is available here:
https://github.com/dmwm/WMCore/tree/master/src/python/WMCore/MicroService/MSRuleCleaner

I think you can abstract the data volume. If it would create a large volume of notifications, I would not request it to be done this way. Answering this though, to find a pileup that is eligible for deletion, this microservice need to be processing the last active workflow that was using that pileup. So worst case scenario, we are having a handful of notifications per week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants