-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSRuleCleaner shouldn't delete pileup input rules during active campaign period #11216
Comments
Thank you for creating this issue, Hasan. As I read it, I have the impression that we could end up with pileup rules remaining in the system forever, since a campaign is not removed while there are active workflows in the system. In other words, pileup would only be eligible for deletion once all the workflows are out of the system. We might have to add something else in front of this development, and that would be a redesign of the campaign document stored in WMCore (also including a "enabled: true/false"). We must also consider this ticket: #11034 |
Yes, perhaps an alert can help which will trigger when there is no active workflow reading the PU. Something like:
|
As it has been discussed over the last few days, we are considering to stop pileup rule cleanup in MSRuleCleaner. Long-term development is discussed in a couple of other tickets, involving a campaign redesign and probably a new microservice. For this issue, I would suggest to:
@vkuznet Valentin, if you are still working tomorrow, do you think you could try to work on this before you leave on holidays? |
@amaltaro , I still have tasks (logging and monitoring) to finish with DBSMigrate/Migration servers, and I rather spent time on them. This issue is new to me in all aspects as I do not have enough knowledge where things should be implemented, on which conditions, how to test them, etc. Plus your plan requires careful overview as things can go wild if create of alerts will scale up with number of workflows. As such the alert system may be overloaded (not in terms of resources, this we can scale up horizontally, but rather in terms of volume of alerts). It is unclear to me that Prometheus is right tool for that. For instance, the alerts in Prometheus cannot be removed, they can only expire, but there is no API to delete them as they assigned to specific timestamp and duration. So far neither the description or your proposal gives me enough information to understand the whole flow and details to decide if Prometheus is a write choice. Instead, the NATS publication/subscription model may be more suitable for this use-case. For example, some workflows can publish the data, while others can subscribe to it and perform certain action (e.g. send notification). Therefore, even the issue is marked as High Priority I think proper thought about correct design and architecture is required here. |
@vkuznet Valentin, if I understand these alerts well, a new alert does not get created if there exists one alert already in the system (alert name + severity + service + tag(?)). How long an alert lives depend on the expiration time, in this case it could be defined to a day or two to avoid triggering another alert. Workflows go only once through the MSRuleCleaner algorithm - in most of the cases - so we shouldn't be getting more than 1 alert per workflow, at tops! The alert wrapper API has been implemented in this module:
and here is an example of how we use it in MSTransferor: NATS is not the way forward here! We haven't integrated NATS in WMCore and I do not want to have yet another monitoring tool within WMCore. With this information in hand, it should be pretty straight forward to implement it. We might have to create a routing rule within CMSMonitoring though, which is likely going to take longer than implementing these changes in WMCore. If you still need more information, please let me know. |
Alan, you do not need to explain me how to create alerts, at the end it is REST API, and WMCore just wraps it. What I asked is different. Let me break it down to individual topics:
None of this information is described in a ticket (it may be well known to you but not to me or may be others). Once this information will clearly provided we can discuss if it is suitable for Prometheus or should we use another system. May be we need to setup dedicated Prometheus server and dedicated AM for this to distinguish this with CMS Monitoring metrics, etc. |
MSRuleCleaner is a microservice that process workflows in status I think you can abstract the data volume. If it would create a large volume of notifications, I would not request it to be done this way. Answering this though, to find a pileup that is eligible for deletion, this microservice need to be processing the last active workflow that was using that pileup. So worst case scenario, we are having a handful of notifications per week. |
Impact of the new feature
MSRuleCleaner
Is your feature request related to a problem? Please describe.
Yes. MSRuleCleaner deletes the input rules if there is no active workflow using it. This logic isn't suitable for pileup input rules. Because, pileup placements are often quite challenging due their size and the sites that can host them and often times their
wmcore_transferor
rules get deleted by MSRuleCleaner when the campaign is on pause, i.e. it doesn't have any workflow at a given time and we need to place them again when the campaign resumes. We experienced this 4 or 5 times of the recent PhaseII campaign, which required quite a manual effort.Describe the solution you'd like
MSRuleCleaner shouldn't delete pileup input rules during the active campaign period. I define the active campaign period as follows:
We shouldn't confuse the campaign creation and deletion w/ enabling and disabling a campaign. Enabling and disabling a campaign is managed by the
go
parameter of unified campaign configuration and it controls the workflow assignment. We might need a pileup when the campaign is disabled. For instance, during the campaign testing period (pilot period), the pileup needs to be on disk for pilot testing and the campaign stays disabled until the pilot succeeds. Because of this reason, I suggested campaign creation and deletion, which simply means to create and delete the WMCore campaign config, e.g. [1]. P&R is responsible to create and delete campaigns and we should be able to control pileup lifetime through campaigns.Note that: we haven't been giving enough attention to delete campaigns from WMCore DB. So, this is feature request gets implemented, we'll need to be very attentive towards campaign deletions.
[1] https://cmsweb.cern.ch/reqmgr2/data/campaignconfig/Commissioning2020
Describe alternatives you've considered
No, happy to hear alternative ideas.
Additional context
You can read this ticket to understand the recent trouble that we had: https://its.cern.ch/jira/browse/CMSCOMPPR-25865
The text was updated successfully, but these errors were encountered: