Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs reading trying to read Pileups files that are only on Tape. #12195

Closed
hassan11196 opened this issue Dec 4, 2024 · 2 comments · Fixed by #12197
Closed

Jobs reading trying to read Pileups files that are only on Tape. #12195

hassan11196 opened this issue Dec 4, 2024 · 2 comments · Fixed by #12197

Comments

@hassan11196
Copy link
Member

Impact of the bug
MS-Pileup, Workflow Updater

Describe the bug
Recently we have been getting reports of workflows trying to read pileup files that are only on tape. Recently the following request failed 100% with it trying to read files from Tape.

https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_SUS-RunIISpring21UL18FSGSPremixLLPBugFix-00065__v1_T_241203_102835_7150

The jobs were reading files from the following pileup.
https://cmsweb.cern.ch/ms-pileup/data/pileup?pileupName=/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX

I checked the pileupconf.json and it has 35 blocks same as the complete dataset listed on DAS, and same number of files (17208) as listed on DAS. Is this expected behavior? Or shouldn't the pileupconf.json should only contain the files that are on disk?
Here is the pileupconf.json,
pileupconf.json

This was a special request in the sense that we have a threshold for the ratio of pileup events on disk and the events to be produced by the workflow, for this wf we bypassed the threshold because it was a special request. However, this should not cause the files to read from tape. but it seems like since more jobs were reading fewer files, this issue became apparent.

I used the rucio client to get block locations for the pileup dataset blocks in the pileupconf.json
rc.list_dataset_replicas(scope="cms",name=block)

Here are the block rse locations:

{'/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#7c3119ff-8c5d-438e-8b35-eaf7e794e761': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#e209adc9-de06-4c01-9ac8-04134291f040': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#5ae882cb-91d0-4e98-ac84-c2cf5e71916f': ['T1_US_FNAL_Disk',
  'T1_US_FNAL_Tape',
  'T2_CH_CERN'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#5367850b-7437-47f4-8ebf-93f988c67a7f': ['T1_US_FNAL_Tape',
  'T2_CH_CERN',
  'T1_US_FNAL_Disk'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#9a758d1c-53e7-4ea4-ae17-f48f87a89252': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#22d2fefb-7834-4207-9f32-fcf66bde5a1d': ['T2_CH_CERN',
  'T1_US_FNAL_Tape',
  'T1_US_FNAL_Disk'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#d6d6b41e-d7d8-4a78-a1f5-325777dc8888': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#62868489-c980-4f1f-8279-32cd0b84d1a8': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#79b77ee8-c14d-4b3e-a480-3e70ff8abc1c': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#4260b42c-058e-44a5-9d45-de7a261af9ff': ['T2_CH_CERN',
  'T1_US_FNAL_Tape',
  'T1_US_FNAL_Disk'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#823b885a-e6fb-4518-bf80-b319c8d6f678': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#7cc80640-9521-47da-bb2a-4d74db541546': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#43e5d2f9-bbc0-4973-90fa-10f881fa2b02': ['T2_CH_CERN',
  'T1_US_FNAL_Disk',
  'T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#bc5ea84e-1f89-45ca-b691-5c307e222507': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#fa509e64-cbd3-47f7-8f8d-058fd72dd53e': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#8917333d-e8e1-448c-9650-abffd775e626': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#8a9036fc-1233-4d2d-9376-d351a74b0fd5': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#a9820ddb-baac-43ba-9877-4ca6db38d182': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#5938bc73-4e07-4466-91c0-d20a55be2a7a': ['T1_US_FNAL_Tape',
  'T2_CH_CERN',
  'T1_US_FNAL_Disk'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#cfd725d2-939b-4f38-97a3-e3b7c4770632': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#324f77d2-4890-4b7e-8eac-aa61fde5ac9e': ['T1_US_FNAL_Tape',
  'T1_US_FNAL_Disk',
  'T2_CH_CERN'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#77eada73-58b2-4a70-9da4-5bc0ecf85aa1': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#ff8158bc-68f9-49e1-8fd2-0c8854e1b5f1': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#1f703fed-12bb-4414-8d50-4622af323144': ['T2_CH_CERN',
  'T1_US_FNAL_Tape',
  'T1_US_FNAL_Disk'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#dc8d3348-d7bc-43e0-b4e5-edf596ce590b': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#bc873935-db83-4f04-b171-b795e1214eb0': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#a28bed4a-7312-4249-a873-3dbfaee57c3f': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#edebb8be-5f5b-4430-9896-1d8e9b4558e0': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#15ba9a31-6314-4a8f-b004-18e44113fbb8': ['T2_CH_CERN',
  'T1_US_FNAL_Disk',
  'T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#b9200d77-bf5c-45d1-906a-ab3a2e09e947': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#a944c1ea-9db9-4e07-b1de-829a0cd55a31': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#e6e35b94-98fb-4c81-b6f5-236fc9089a50': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#951dff37-4893-4800-a6fe-28651dab236d': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#559e5cc9-c84e-4227-84b0-ce3516941d18': ['T1_US_FNAL_Disk',
  'T1_US_FNAL_Tape',
  'T2_CH_CERN'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#eed5b1b2-1d52-4b85-a246-f3f460ed4aab': ['T1_US_FNAL_Tape']}

Out of the 35 total blocks, 25 are on Tape. 10 on disk. This roughly equals the pileup fraction on disk i.e 0.28

Counter({('T1_US_FNAL_Tape',): 25,
         ('T1_US_FNAL_Disk', 'T1_US_FNAL_Tape', 'T2_CH_CERN'): 10})

How to reproduce it
Steps to reproduce the behavior:
I would suspect that if we resubmit this workflow as a backfill we will run into the same situation.

Expected behavior
Jobs to only read pileup files that are on disk.

FYI @amaltaro @vkuznet

@vkuznet
Copy link
Contributor

vkuznet commented Dec 4, 2024

@hassan11196 , thank you for comprehensive summary. But I think we should clarify the following items:

  • when user specify fraction of a dataset the current algorithm uses rucio client getBlocksInContainer API for custom container, and I think rucio does not differentiate files on disk vs files on tape. Therefore, to resolve the issue we must define rules based on different use-case, i.e. when files from any location should be used vs case when files on disk should be used. How you will define such criteria?
  • even if we define criteria to use only files on disk, we must define what is a fraction means, is it fraction from portion of files on disk or overall, how to define it for different use-cases?
  • we may revisit usage of RSEs and remove those which lists Tape, but again it comes to the question in which cases it is suitable.

My point is that even though I understand your use-case I don't posses all information how to deal with all use-cases and until we define such rules it is hard (almost impossible) to tweak code to deal with all of them. I bet if we accommodate this use-case other use-cases will break because of such change.

@amaltaro
Copy link
Contributor

amaltaro commented Dec 5, 2024

After having a chat with Ahmed, we concluded that we have overlooked one use case for custom pileup containers.

In this example, the workflow was active in the agent for only about 6h, which wasn't enough to get the workflow through WorkflowUpdater and a list of up-to-date pileup data and location. That means, the workflow was using the original pileup container, because this is what it is used when acquiring a workflow in the agent.

In other words, when work is acquired for the first time between LQ and WMBS, the agent creates the workflow sandbox and the pileupconf.json. The sandbox creation process can call a set of fetchers:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMRuntime/SandboxCreator.py#L134

and PileupFetcher is one of them:

where we can see it uses the pileup name from the workflow description.

What we have to do is: we need to integrate MSPileup into this PileupFetcher module, such that whenever there is a custom container defined for a given original pileup name, we resolve location and data for the custom container instead of the original one.

Thanks Ahmed for sharing all these details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants