Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt MSPileup data into PileupFetcher #12197

Merged
merged 4 commits into from
Dec 10, 2024
Merged

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Dec 5, 2024

Fixes #12195

Status

ready

Description

Summary of changes is:

  • provide a new utilitarian getDomainName to parse the cmsweb endpoint url
  • adopt MSPileup data for pileup location in PileupFetcher (we missed it when unifying pileup location across WMCore!!!)
  • if a custom pileup is defined, use it to list all the blocks available in Rucio; otherwise use the standard pileup name
  • Patch PileupFetcher with MockMSPileup
  • Update some of the Rucio and MSPileup emulators
  • Finally, updated some of the Rucio mocked data (manually somehow, given that some of those data are no longer in Rucio...)
  • lastly, DBS blocks not present in Rucio will get removed from the pileupconf.json file

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

@amaltaro amaltaro changed the title Fix 12195 Adopt MSPileup data into PileupFetcher Dec 5, 2024
@dmwm-bot
Copy link

dmwm-bot commented Dec 5, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 318 new failures
    • 5 tests deleted
    • 3 tests no longer failing
    • 1 tests added
    • 13 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 1 warnings
    • 26 comments to review
  • Pycodestyle check: succeeded
    • 5 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/151/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

dmwm-bot commented Dec 6, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 6 new failures
    • 5 tests no longer failing
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 1 warnings
    • 26 comments to review
  • Pycodestyle check: succeeded
    • 5 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/152/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Dec 6, 2024

test this please

@dmwm-bot
Copy link

dmwm-bot commented Dec 6, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 6 new failures
    • 5 tests no longer failing
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 1 warnings
    • 26 comments to review
  • Pycodestyle check: succeeded
    • 5 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/153/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Dec 6, 2024

I am running the following testbed tests (in vocms0193)
amaltaro_SC_MultiPU_Agent237_Val_241206_045937_5318
amaltaro_SC_PU_5Steps_Agent237_Val_241206_050012_2817

and hopefully the failing unit tests have now been sorted as well.

I fear that to properly test it though, we will need to have an agent connected to production Rucio and MSPileup. @hassan11196 I might have to ping you tomorrow to inject the workflow that failed as a backfill (in a backfill agent..)

@dmwm-bot
Copy link

dmwm-bot commented Dec 6, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 3 tests no longer failing
    • 1 tests added
    • 4 changes in unstable tests
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 5 warnings
    • 64 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/154/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Dec 6, 2024

The unit test failing:

    WMCore_t.Storage_t.StageInMgr_t.StageInMgrUnitTest:testStageInMgr changed from success to error

seems to be unrelated (maybe these CVMFS is no longer mounted in these jenkins nodes?):

Unable to find site local config file:
/cvmfs/cms.cern.ch/SITECONF/T1_US_FNAL/JobConfig/site-local-config.xml

@d-ylee are you aware of any changes that might have impacted this? If I am not wrong, at some point we asked Shahzad to mount cvmfs in all Jenkins nodes, no?

Even though I am still to check my tests in the agent, I think these changes are ready for a review.

@amaltaro
Copy link
Contributor Author

amaltaro commented Dec 6, 2024

Just pushed in a 5th commit which is supposed to resolve failures like:

2024-12-06 11:53:18,388:139638026835712:ERROR:WMBSHelper:Failed to create subscription. Error: <@========== WMException Start ==========@>
Exception Class: WMRucioDIDNotFoundException
Message: Data identifier not found in Rucio: /RelValMinBias_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM-V4. Error: Data identifier not found.
Details: Data identifier 'cms:/RelValMinBias_14TeV/CMSSW_11_2_0_pre8-112X_mcRun3_2024_realistic_v10_forTrk-v1/GEN-SIM-V4' not found
        ClassName : None
        ModuleName : WMCore.Services.Rucio.Rucio
        MethodName : isContainer
        ClassInstance : None
        FileName : /usr/local/lib/python3.8/site-packages/WMCore/Services/Rucio/Rucio.py
        LineNumber : 804
        ErrorNr : 0

Traceback: 
  File "/usr/local/lib/python3.8/site-packages/WMCore/Services/Rucio/Rucio.py", line 801, in isContainer
    response = self.cli.get_did(scope=scope, name=didName)

  File "/usr/local/lib/python3.8/site-packages/rucio/client/didclient.py", line 424, in get_did
    raise exc_cls(exc_msg)

which happens because testbed/preprod MSPileup talks to Integration Rucio, meaning the custom containers MSPileup don't exist in Production Rucio.

Nonetheless, it also fails with the Integration database, maybe because the integration with Rucio Int in WM is very incomplete and fragile... We need to revisit this ASAP.

@dmwm-bot
Copy link

dmwm-bot commented Dec 6, 2024

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 tests no longer failing
    • 1 tests added
    • 5 changes in unstable tests
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 5 warnings
    • 65 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/155/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

dmwm-bot commented Dec 6, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 5 warnings
    • 65 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/160/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

dmwm-bot commented Dec 6, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 1 tests added
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 5 warnings
    • 65 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/161/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Dec 7, 2024

Together with Ahmed, we have been validating these changes in a backfill agent and the workflow injected is:

I still see a high failure rate, but the problem is because this workflow was injected with secondary AAA enabled, so it will run everywhere in the site list - and not only at FNAL and CERN

I have copied parts of this workflow above, from the WorkQueueManager cache area, under the following directory:

cmst1@vocms0254:amaltaro $ pwd
/data/srv/amaltaro

and indented the following 2 files (located at WMSandbox/SUS-RunIISpring21UL18FSGSPremixLLPBugFix-00065_0/cmsRun1/pileupconf.json):

  • inside_sandbox_pu.json: it contains the content of pileupconf.json from inside the Sandbox.tar.bz2 file
  • outside_sandbox_pu.json: it contains the content of the pileupconf.json from outside the sandbox tarball

My observations are:

  1. WorkQueueManager created a pileupconf.json file with all 35 Neutrino blocks, but only 10 of them have the FNAL+CERN set under PhEDExNodeNames --> GOOD! It is the expected behavior
  2. WorkflowUpdater noticed differences and updated the Sandbox tarball, which now contain only the 10 blocks existent in the custom pileup with FNAL+CERN set under PhEDExNodeNames --> This is OKAY for disabled secondary AAA. This is BAD for enabled secondary AAA though, as it now limits the files that it can read at runtime. Is it a valid use case though?
  3. WorkflowUpdater will - I am still waiting for the next cycle - update the sandbox again and again and again, as it compares MSPileup information against the pileupconf.json outside of the sandbox, which is never updated. --> BAD!!! Wasting resources and making the component less robust.

Final summary: despite problems beyond the "scope" of this ticket, I think the changes provided in this PR bring us to the expected and healthy behavior.

TODO: create a new ticket for observation 3) above.

@dmwm-bot
Copy link

dmwm-bot commented Dec 7, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 5 new failures
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 5 warnings
    • 65 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/164/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

dmwm-bot commented Dec 7, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 5 warnings
    • 65 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/165/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

dmwm-bot commented Dec 9, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 5 warnings
    • 65 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/169/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

dmwm-bot commented Dec 9, 2024

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 5 warnings
    • 65 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/170/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Dec 9, 2024

My previous commits were all on the mock data and emulator, instead of changing any of the logical implementation.
With all unit tests succeeding now, feel free to proceed with code review.

@dmwm-bot
Copy link

dmwm-bot commented Dec 9, 2024

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 5 warnings
    • 65 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/177/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

A snippet of the WorkQueueManager logs in the backfill agent (vocms0254) are:

2024-12-09 23:44:25,153:140168095241984:INFO:PileupFetcher:Found 35 blocks in DBS for dataset /Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX with 17208 files
2024-12-09 23:44:25,231:140168095241984:INFO:PileupFetcher:Pileup dataset /Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX with:
        custom name: /Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX-V1,
        current RSEs: ['T1_US_FNAL_Disk', 'T2_CH_CERN']
        and container fraction: 0.28
2024-12-09 23:44:25,286:140168095241984:INFO:PileupFetcher:Found 10 blocks in container /Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX-V1 for scope group.wmcore
...
2024-12-09 23:44:25,287:140168095241984:WARNING:PileupFetcher:Block /Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#cfd725d2-939b-4f38-97a3-e3b7c4770632 present in DBS but not in Rucio. Removing it.
...
2024-12-09 23:44:25,288:140168095241984:INFO:PileupFetcher:Final pileup dataset /Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX-V1 has a total of 10 blocks.

so it is all looking good to me.

'filters': ['pileupName', 'customName', 'containerFraction', 'currentRSEs']}
doc = getPileupDocs(msPileupUrl, queryDict, method='POST')[0]
msg = f'Pileup dataset {doc["pileupName"]} with:\n\tcustom name: {doc["customName"]},'
msg += f'\n\tcurrent RSEs: {doc["currentRSEs"]}\n\tand container fraction: {doc["containerFraction"]}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro do we need any translation like this one for currentRSEs in this case? Probably not, since your tests are successful, but better to document it here with this comment, for future reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a relevant question!
We do not need to perform any mapping here because MSPileup (and Rucio) are already returning a PhEDEx Node Name (aka RSE).

And what is expected during runtime is the PNN, as it is loaded from the Site Local Config:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMRuntime/Scripts/SetupCMSSWPset.py#L378

So we are good.

@anpicci
Copy link
Contributor

anpicci commented Dec 10, 2024

@amaltaro thank you for working on this. This looks great to me, I have a minor comment for your consideration. Then, I am fine with approving the PR

@amaltaro
Copy link
Contributor Author

Thank you, Andrea. I will improve the docstring for getDomainName, which I just noticed to be incomplete (bad job from my IDE) and squash the commits accordingly.

fix import

Fix PileupFetcher logging and kwargs

Support different Rucio urls in PileupFetcher - according to the DBS instance

Custom pileup requires a custom scope as well

another fix for custom pileup name

Remove blocks available in DBS but not in Rucio

improve docstring
fix scope in MakeRucioMockFile.py
update Rucio mocked data

fix signature of getBlocksInContainer in Rucio data
more unit test fixes
@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 tests added
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 5 warnings
    • 65 comments to review
  • Pycodestyle check: succeeded
    • 9 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/188/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro merged commit 53491a6 into dmwm:master Dec 10, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Jobs reading trying to read Pileups files that are only on Tape.
3 participants