Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSUnmerged fails to parse T1_UK_RAL_Disk stats - root_failed error #10893

Closed
amaltaro opened this issue Nov 8, 2021 · 0 comments · Fixed by #10894
Closed

MSUnmerged fails to parse T1_UK_RAL_Disk stats - root_failed error #10893

amaltaro opened this issue Nov 8, 2021 · 0 comments · Fixed by #10894

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Nov 8, 2021

Impact of the bug
MSUnmerged

Describe the bug
The T1_UK_RAL_Disk RSE has been enabled in the MSUnmerged service configuration, with that a new issue [1] got exposed with the root_failed attribute that we currently use to check whether the scanner succeeded checking the directories or not. Given that there is no scanner for RAL, but only a text file that needs to be downloaded, that key/value attribute does not exist in the WM/stats API output.

This issue has been discussed with Igor M., in the #cms-consistency channel and he suggests to use only the status attribute.

How to reproduce it
Checkout out the T1_UK_RAL_Disk output from WM/stats API

Expected behavior
Remove that check on root_failed:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py#L406

and keep checking RSEs dump status based only on the status attribute (as already coded 2 lines above).

Additional context and error message
Quoting a short summary from the information provided by Igor M.
"""
The status attribute says whether the scanner failed or succeeded. The attribute value can be either “started” or “done” or “failed”.
For regular sites, status="failed" is equivalent to root_failed=true
For RAL it means that the site dump download has failed for this run
“started” means the scanner has not finished yet
For RAL, it means that the site dump is still being downloaded
“done” means it scanned the whole tree, but in case of unmerged scanning, it does not necessarily mean that all subdirectories were scanned successfully.
For RAL it means that the site dump was downloaded successfully
"""

[1]

2021-11-08 01:54:59,148:ERROR:MSUnmerged: plineUnmerged: General error from pipeline. RSE: T1_UK_RAL_Disk. Error: 'root_failed' Will retry again in the next cycle.
Traceback (most recent call last):
  File "/data/srv/HG2111a/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.5.pre3/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 235, in _execute
    pline.run(MSUnmergedRSE(rseName))
  File "/data/srv/HG2111a/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.5.pre3/lib/python3.8/site-packages/Utils/Pipeline.py", line 140, in run
    return reduce(lambda obj, functor: functor(obj), self.funcLine, obj)
  File "/data/srv/HG2111a/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.5.pre3/lib/python3.8/site-packages/Utils/Pipeline.py", line 140, in <lambda>
    return reduce(lambda obj, functor: functor(obj), self.funcLine, obj)
  File "/data/srv/HG2111a/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.5.pre3/lib/python3.8/site-packages/Utils/Pipeline.py", line 72, in __call__
    return self.run(obj)
  File "/data/srv/HG2111a/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.5.pre3/lib/python3.8/site-packages/Utils/Pipeline.py", line 75, in run
    return self.func(obj, *self.args, **self.kwargs)
  File "/data/srv/HG2111a/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.5.5.pre3/lib/python3.8/site-packages/WMCore/MicroService/MSUnmerged/MSUnmerged.py", line 406, in consRecordAge
    isRootFailed = self.rseConsStats[rseName]['root_failed']
KeyError: 'root_failed'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant