Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to RHEL8 workflows/jobs #11051

Closed
amaltaro opened this issue Mar 25, 2022 · 8 comments · Fixed by #11060, #11062 or #11077
Closed

Add support to RHEL8 workflows/jobs #11051

amaltaro opened this issue Mar 25, 2022 · 8 comments · Fixed by #11060, #11062 or #11077

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Mar 25, 2022

Impact of the new feature
ReqMgr2 / WMAgent

Is your feature request related to a problem? Please describe.
Following an email exchange with Shahzad, they would like to start running tests with RHEL8 production workflows as soon as possible.

Describe the solution you'd like
In short, WMCore systems should support workflow creation and assignment as well as job creation and execution under RHEL8 OS (or variations of it, like EL8, Alma8, CS8, etc).

A non-exhaustive list of places that we need to verify and/or update if required, is:

  1. submit_py3.sh job wrapper
  2. the Scram.py module, which should have a map of ScramArch to OSes
  3. SimpleCondorPlugin

UPDATE: as provided by Shahzad below, the list of new archs to be supported is:

(el|cc|cs|alma)N_(amd64|ppc64le|aarch64)_gccV

(where N>=8 and V>=10).

Describe alternatives you've considered
None

Additional context
In order to properly test it, we need to identify:

  • which CMSSW release we could use for this test. Apparently CMSSW_12_3_0_preX releases under cs8_amd64_gcc10
  • which ScramArch are we going to use for such workflows? Is it going to be only slc8_*, or should we consider the other variations?
  • and whether there are RHEL8 pilot containers that we could use (thus, being able to run such workflows anywhere)
@smuzaffar
Copy link

@amaltaro , currently we have cs8_amd64_gcc10 releases available for tests. Soon we will build alma8_amd64_gcc10 release too.
Starting with EL8, we are not using slc for operating system name string. New archs are (cc|cs|alma)N_(amd64|ppc64le|aarch64)_gccV (where N>=8 and V>=10). There is already singularity container rhel8 (just like rhel7 and rhel6) available on cvmfs.

@amaltaro
Copy link
Contributor Author

Thanks, Shahzad!

And I heard back from Danilo and RHEL8 needs to be validated before the new HLT farm comes online (on the April 15). So, this needs to be worked in the next week.

@amaltaro
Copy link
Contributor Author

update: initial description updated to reflect the required changes.

@smuzaffar
Copy link

thanks @amaltaro for adding the support. I noticed here that you have aarch64 and ppc64le mentioned. Does this mean we can now properly handle requests for these archs too?

@amaltaro
Copy link
Contributor Author

Yes, but that architecture mapping only affects the job description and target grid resources for executing the payload.

To complete the loop, I think we would need to have some libraries (including python and python future) in CVMFS, such that the job wrapper can use libraries built in/for those architectures. @smuzaffar

@smuzaffar
Copy link

For ppc64le we already have /cvmfs/cms.cern.ch/COMP/slc7_ppc64le_gcc630 , so should I add a similar path for slc7_aarch64_gcc630 ? Do we need some updates here for new el8 archs?

@amaltaro
Copy link
Contributor Author

@smuzaffar let me close a new release for CMSWEB and then I will come back to you on this, ok? I will create a new GH issue as well to track this.

@amaltaro
Copy link
Contributor Author

amaltaro commented Apr 9, 2022

Now that we managed to run real EL8 workflows to properly test it, it looks like the submit_py3.sh script indeed needs to be changed now that we have these new EL8 ScramArchs, apparently here:
https://github.com/dmwm/WMCore/blob/master/etc/submit_py3.sh#L141-L149

Here is a RelVal EL8 workflow which was pretty successful, but most of the merge jobs are somehow failing. Some logs can be found here:
https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PRODUCTION/pdmvserv_RVCMSSW_12_4_0_pre2WToLNu_14TeV__TEST_alma_220408_083712_7096/GenSimFullMergeFEVTDEBUGoutput

and this part of the wmagentJob.log is suspicious (we had a REQUIRED_OS='rhel8', landed on a EL7 node, but used EL6 python libraries):

2022-04-08 11:32:59,848:INFO:Startup:Python path: ['/srv/job', '/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/py2-future/0.16.0/lib/python3.6/site-packages', '/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.60-2/el8-x86_64/usr/lib/python3.6/site-packages', '/cvmfs/oasis.opensciencegrid.org/mis/osg-wn-client/3.5/3.5.60-2/el8-x86_64/usr/lib64/python3.6/site-packages', '/srv/job/WMCore.zip', '/srv/job', '/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/python3/3.6.4/lib/python36.zip', '/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/python3/3.6.4/lib/python3.6', '/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/python3/3.6.4/lib/python3.6/lib-dynload', '/cvmfs/cms.cern.ch/slc6_amd64_gcc700/external/python3/3.6.4/lib/python3.6/site-packages']

reopening this issue to address our job wrapper changes and proceed with further tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment