Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add try/except block for fetching PartialCopy parameter from campaign configuration. #11387

Conversation

todor-ivanov
Copy link
Contributor

Fixes #11386

Status

ready

Description

When a compaign configuration for a given workflow transfer is missing at ReqMgr, MSMonitor breaks the execution of the polling cycle, but not the service. In order to prevent the service from skipping all the workflows from the current polling cycle due to a single transfer error, with the current PR we add a try/except block around the step at which a the campaign parameters for the transfer are obtained.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

@todor-ivanov todor-ivanov requested a review from amaltaro December 7, 2022 09:19
@todor-ivanov
Copy link
Contributor Author

I have just patched the running service in testbed, and from the logs it becomes immediately obvious which were the workflows affected by the missing campaing: [1]. While those were skipped, the rest of the workflows did manage to advance their status to staged: [2]

[1]

2022-12-07 10:23:43,930:ERROR:MSMonitor: Missing or broken campaign configuration at ReqMgr for request: amaltaro_TC_Nano_Agent212_Val_220902_230557_2309. Error: 'RunIIAutumn18NanoAODv5'
Traceback (most recent call last):
  File "/data/srv/HG2212c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.1.5rc4/lib/python3.8/site-packages/WMCore/MicroService/MSMonitor/MSMonitor.py", line 234, in getCompletedWorkflows
    cdict = campaigns[transfer['campaignName']]
KeyError: 'RunIIAutumn18NanoAODv5'
2022-12-07 10:23:43,931:ERROR:MSMonitor: Missing or broken campaign configuration at ReqMgr for request: amaltaro_TC_Nano_Agent212_Val_220907_124517_9912. Error: 'RunIIAutumn18NanoAODv5'
Traceback (most recent call last):
  File "/data/srv/HG2212c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.1.5rc4/lib/python3.8/site-packages/WMCore/MicroService/MSMonitor/MSMonitor.py", line 234, in getCompletedWorkflows
    cdict = campaigns[transfer['campaignName']]
KeyError: 'RunIIAutumn18NanoAODv5'
2022-12-07 10:23:43,931:ERROR:MSMonitor: Missing or broken campaign configuration at ReqMgr for request: amaltaro_TC_Nano_Agent212_Val_220908_221955_6505. Error: 'RunIIAutumn18NanoAODv5'
Traceback (most recent call last):
  File "/data/srv/HG2212c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.1.5rc4/lib/python3.8/site-packages/WMCore/MicroService/MSMonitor/MSMonitor.py", line 234, in getCompletedWorkflows
    cdict = campaigns[transfer['campaignName']]
KeyError: 'RunIIAutumn18NanoAODv5'
...
2022-12-07 10:23:43,931:WARNING:MSMonitor: Not updating transfer record in CouchDB for: amaltaro_TC_Nano_Agent212_Val_220902_230557_2309
2022-12-07 10:23:43,931:WARNING:MSMonitor: Not updating transfer record in CouchDB for: amaltaro_TC_Nano_Agent212_Val_220907_124517_9912
2022-12-07 10:23:43,932:WARNING:MSMonitor: Not updating transfer record in CouchDB for: amaltaro_TC_Nano_Agent212_Val_220908_221955_6505
...

[2]

...
2022-12-07 10:23:45,910:INFO:ReqMgrAux: Update in-place: False for transfer doc: tivanov_ACDC_TC_PY3_TTbarPU_HG2212_Val_221205_085907_1259 was successful.
...
2022-12-07 10:23:48,016:INFO:MSCore: MSMonitor updating tivanov_ACDC_TC_PY3_TTbarPU_HG2212_Val_221205_085907_1259 status to: staged
2022-12-07 10:23:48,308:INFO:MSMonitor: MSMonitor processed 18 transfer records, where 13 completed their data transfers, 5 failed to contact the DM system and were skipped in this cycle and 0 failed to get their transfer documents updated in CouchDB.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 4 warnings
    • 9 comments to review
  • Pylint py3k check: failed
    • 1 warnings
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13824/artifact/artifacts/PullRequestReport.html

@todor-ivanov todor-ivanov changed the title Add try/except block for fetching PartialCopy parameter from camapign configuration. Add try/except block for fetching PartialCopy parameter from campaign configuration. Dec 7, 2022
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing this fix, Todor. Ideally, it should never happen, but yes, mistakes can happen and we better have this protection in.

@amaltaro amaltaro merged commit 02b1a0c into dmwm:master Dec 7, 2022
@todor-ivanov
Copy link
Contributor Author

thanks @amaltaro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MSMonitor breaks when finding an active transfer but the transfer's campaign is missing at ReqMgr
3 participants