Jobs reading trying to read Pileups files that are only on Tape. #12195

hassan11196 · 2024-12-04T17:46:00Z

Impact of the bug
MS-Pileup, Workflow Updater

Describe the bug
Recently we have been getting reports of workflows trying to read pileup files that are only on tape. Recently the following request failed 100% with it trying to read files from Tape.

https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_SUS-RunIISpring21UL18FSGSPremixLLPBugFix-00065__v1_T_241203_102835_7150

The jobs were reading files from the following pileup.
https://cmsweb.cern.ch/ms-pileup/data/pileup?pileupName=/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX

I checked the pileupconf.json and it has 35 blocks same as the complete dataset listed on DAS, and same number of files (17208) as listed on DAS. Is this expected behavior? Or shouldn't the pileupconf.json should only contain the files that are on disk?
Here is the pileupconf.json,
pileupconf.json

This was a special request in the sense that we have a threshold for the ratio of pileup events on disk and the events to be produced by the workflow, for this wf we bypassed the threshold because it was a special request. However, this should not cause the files to read from tape. but it seems like since more jobs were reading fewer files, this issue became apparent.

I used the rucio client to get block locations for the pileup dataset blocks in the pileupconf.json
rc.list_dataset_replicas(scope="cms",name=block)

Here are the block rse locations:

{'/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#7c3119ff-8c5d-438e-8b35-eaf7e794e761': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#e209adc9-de06-4c01-9ac8-04134291f040': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#5ae882cb-91d0-4e98-ac84-c2cf5e71916f': ['T1_US_FNAL_Disk',
  'T1_US_FNAL_Tape',
  'T2_CH_CERN'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#5367850b-7437-47f4-8ebf-93f988c67a7f': ['T1_US_FNAL_Tape',
  'T2_CH_CERN',
  'T1_US_FNAL_Disk'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#9a758d1c-53e7-4ea4-ae17-f48f87a89252': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#22d2fefb-7834-4207-9f32-fcf66bde5a1d': ['T2_CH_CERN',
  'T1_US_FNAL_Tape',
  'T1_US_FNAL_Disk'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#d6d6b41e-d7d8-4a78-a1f5-325777dc8888': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#62868489-c980-4f1f-8279-32cd0b84d1a8': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#79b77ee8-c14d-4b3e-a480-3e70ff8abc1c': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#4260b42c-058e-44a5-9d45-de7a261af9ff': ['T2_CH_CERN',
  'T1_US_FNAL_Tape',
  'T1_US_FNAL_Disk'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#823b885a-e6fb-4518-bf80-b319c8d6f678': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#7cc80640-9521-47da-bb2a-4d74db541546': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#43e5d2f9-bbc0-4973-90fa-10f881fa2b02': ['T2_CH_CERN',
  'T1_US_FNAL_Disk',
  'T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#bc5ea84e-1f89-45ca-b691-5c307e222507': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#fa509e64-cbd3-47f7-8f8d-058fd72dd53e': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#8917333d-e8e1-448c-9650-abffd775e626': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#8a9036fc-1233-4d2d-9376-d351a74b0fd5': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#a9820ddb-baac-43ba-9877-4ca6db38d182': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#5938bc73-4e07-4466-91c0-d20a55be2a7a': ['T1_US_FNAL_Tape',
  'T2_CH_CERN',
  'T1_US_FNAL_Disk'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#cfd725d2-939b-4f38-97a3-e3b7c4770632': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#324f77d2-4890-4b7e-8eac-aa61fde5ac9e': ['T1_US_FNAL_Tape',
  'T1_US_FNAL_Disk',
  'T2_CH_CERN'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#77eada73-58b2-4a70-9da4-5bc0ecf85aa1': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#ff8158bc-68f9-49e1-8fd2-0c8854e1b5f1': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#1f703fed-12bb-4414-8d50-4622af323144': ['T2_CH_CERN',
  'T1_US_FNAL_Tape',
  'T1_US_FNAL_Disk'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#dc8d3348-d7bc-43e0-b4e5-edf596ce590b': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#bc873935-db83-4f04-b171-b795e1214eb0': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#a28bed4a-7312-4249-a873-3dbfaee57c3f': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#edebb8be-5f5b-4430-9896-1d8e9b4558e0': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#15ba9a31-6314-4a8f-b004-18e44113fbb8': ['T2_CH_CERN',
  'T1_US_FNAL_Disk',
  'T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#b9200d77-bf5c-45d1-906a-ab3a2e09e947': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#a944c1ea-9db9-4e07-b1de-829a0cd55a31': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#e6e35b94-98fb-4c81-b6f5-236fc9089a50': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#951dff37-4893-4800-a6fe-28651dab236d': ['T1_US_FNAL_Tape'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#559e5cc9-c84e-4227-84b0-ce3516941d18': ['T1_US_FNAL_Disk',
  'T1_US_FNAL_Tape',
  'T2_CH_CERN'],
 '/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX#eed5b1b2-1d52-4b85-a246-f3f460ed4aab': ['T1_US_FNAL_Tape']}

Out of the 35 total blocks, 25 are on Tape. 10 on disk. This roughly equals the pileup fraction on disk i.e 0.28

Counter({('T1_US_FNAL_Tape',): 25,
         ('T1_US_FNAL_Disk', 'T1_US_FNAL_Tape', 'T2_CH_CERN'): 10})

How to reproduce it
Steps to reproduce the behavior:
I would suspect that if we resubmit this workflow as a backfill we will run into the same situation.

Expected behavior
Jobs to only read pileup files that are on disk.

FYI @amaltaro @vkuznet

The text was updated successfully, but these errors were encountered:

vkuznet · 2024-12-04T18:54:36Z

@hassan11196 , thank you for comprehensive summary. But I think we should clarify the following items:

when user specify fraction of a dataset the current algorithm uses rucio client getBlocksInContainer API for custom container, and I think rucio does not differentiate files on disk vs files on tape. Therefore, to resolve the issue we must define rules based on different use-case, i.e. when files from any location should be used vs case when files on disk should be used. How you will define such criteria?
even if we define criteria to use only files on disk, we must define what is a fraction means, is it fraction from portion of files on disk or overall, how to define it for different use-cases?
we may revisit usage of RSEs and remove those which lists Tape, but again it comes to the question in which cases it is suitable.

My point is that even though I understand your use-case I don't posses all information how to deal with all use-cases and until we define such rules it is hard (almost impossible) to tweak code to deal with all of them. I bet if we accommodate this use-case other use-cases will break because of such change.

amaltaro · 2024-12-05T16:04:14Z

After having a chat with Ahmed, we concluded that we have overlooked one use case for custom pileup containers.

In this example, the workflow was active in the agent for only about 6h, which wasn't enough to get the workflow through WorkflowUpdater and a list of up-to-date pileup data and location. That means, the workflow was using the original pileup container, because this is what it is used when acquiring a workflow in the agent.

In other words, when work is acquired for the first time between LQ and WMBS, the agent creates the workflow sandbox and the pileupconf.json. The sandbox creation process can call a set of fetchers:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMRuntime/SandboxCreator.py#L134

and PileupFetcher is one of them:

WMCore/src/python/WMCore/WMSpec/Steps/Fetchers/PileupFetcher.py

Line 193 in 8e53d51

def __call__(self, wmTask):

where we can see it uses the pileup name from the workflow description.

What we have to do is: we need to integrate MSPileup into this PileupFetcher module, such that whenever there is a custom container defined for a given original pileup name, we resolve location and data for the custom container instead of the original one.

Thanks Ahmed for sharing all these details.

amaltaro self-assigned this Dec 5, 2024

amaltaro added BUG WMAgent MSPileup labels Dec 5, 2024

amaltaro added this to WMCore quarterly developments Dec 5, 2024

amaltaro moved this to In Progress in WMCore quarterly developments Dec 5, 2024

amaltaro mentioned this issue Dec 5, 2024

Adopt MSPileup data into PileupFetcher #12197

Merged

amaltaro closed this as completed in #12197 Dec 10, 2024

github-project-automation bot moved this from In Progress to Done in WMCore quarterly developments Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs reading trying to read Pileups files that are only on Tape. #12195

Jobs reading trying to read Pileups files that are only on Tape. #12195

hassan11196 commented Dec 4, 2024

vkuznet commented Dec 4, 2024

amaltaro commented Dec 5, 2024

Jobs reading trying to read Pileups files that are only on Tape. #12195

Jobs reading trying to read Pileups files that are only on Tape. #12195

Comments

hassan11196 commented Dec 4, 2024

vkuznet commented Dec 4, 2024

amaltaro commented Dec 5, 2024