-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs reading trying to read Pileups files that are only on Tape. #12195
Comments
@hassan11196 , thank you for comprehensive summary. But I think we should clarify the following items:
My point is that even though I understand your use-case I don't posses all information how to deal with all use-cases and until we define such rules it is hard (almost impossible) to tweak code to deal with all of them. I bet if we accommodate this use-case other use-cases will break because of such change. |
After having a chat with Ahmed, we concluded that we have overlooked one use case for custom pileup containers. In this example, the workflow was active in the agent for only about 6h, which wasn't enough to get the workflow through WorkflowUpdater and a list of up-to-date pileup data and location. That means, the workflow was using the original pileup container, because this is what it is used when acquiring a workflow in the agent. In other words, when work is acquired for the first time between LQ and WMBS, the agent creates the workflow sandbox and the pileupconf.json. The sandbox creation process can call a set of fetchers: and PileupFetcher is one of them:
where we can see it uses the pileup name from the workflow description. What we have to do is: we need to integrate MSPileup into this PileupFetcher module, such that whenever there is a custom container defined for a given original pileup name, we resolve location and data for the custom container instead of the original one. Thanks Ahmed for sharing all these details. |
Impact of the bug
MS-Pileup, Workflow Updater
Describe the bug
Recently we have been getting reports of workflows trying to read pileup files that are only on tape. Recently the following request failed 100% with it trying to read files from Tape.
https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_SUS-RunIISpring21UL18FSGSPremixLLPBugFix-00065__v1_T_241203_102835_7150
The jobs were reading files from the following pileup.
https://cmsweb.cern.ch/ms-pileup/data/pileup?pileupName=/Neutrino_E-10_gun/RunIIFall17FSPrePremix-PUFSUL18CP5_106X_upgrade2018_realistic_v16-v1/PREMIX
I checked the pileupconf.json and it has 35 blocks same as the complete dataset listed on DAS, and same number of files (17208) as listed on DAS. Is this expected behavior? Or shouldn't the pileupconf.json should only contain the files that are on disk?
Here is the pileupconf.json,
pileupconf.json
This was a special request in the sense that we have a threshold for the ratio of pileup events on disk and the events to be produced by the workflow, for this wf we bypassed the threshold because it was a special request. However, this should not cause the files to read from tape. but it seems like since more jobs were reading fewer files, this issue became apparent.
I used the rucio client to get block locations for the pileup dataset blocks in the pileupconf.json
rc.list_dataset_replicas(scope="cms",name=block)
Here are the block rse locations:
Out of the 35 total blocks, 25 are on Tape. 10 on disk. This roughly equals the pileup fraction on disk i.e 0.28
How to reproduce it
Steps to reproduce the behavior:
I would suspect that if we resubmit this workflow as a backfill we will run into the same situation.
Expected behavior
Jobs to only read pileup files that are on disk.
FYI @amaltaro @vkuznet
The text was updated successfully, but these errors were encountered: