-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate all the partial pileup features together and run final pre-production tests #11736
Comments
Now we should have all the required development already merged in and available in a given tag. For central services, partial pileup feature is fully implemented starting in |
Alan, before taking care of this issue please provide additional context to supplement description of the ticket. My understanding that testing will require the following:
|
There is extra work to be done here, so reopening this ticket. |
I successfully tested (using 2.3.2rc5 release candidate) transition of the following dataset
The testing was done using REST API by changing the I'll repeat now tests using 2.3.2rc6 release candidate and another dataset |
Update1:
I hope this is sufficient tests to declare working MSPileup/MSPileupTasks. @amaltaro please provide your feedback on this observation. |
Yes, I think this is good enough for MSPileup. We still have to see how global and local workqueue behave with these changes though. For that, I would suggest assigning a workflow with pileup to a team name that does not exist, such that workqueue elements will be forever in Available status (that way we can test both the workqueue creation and location update). We also need to hear back from @todor-ivanov on the WMAgent validation (of partial pileup). |
Here is your workflow with nonexistent team name: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=tivanov_SC_MultiPU_Feb2024_Val_PartPU_v3_240330_073439_5439 |
hi @amaltaro
|
The workflow used for testing Local work Queue was: https://cmsweb-testbed.cern.ch/reqmgr2/fetch?rid=tivanov_SC_MultiPU_Feb2024_Val_PartPU_v5_240415_132937_6883 The LWQ logs for injecting it to WMBS:
|
Agent's
|
Upon changing the Pileup fraction of
And checked if the workflows data have been updated a the agent. Here are the logs from
So as one can see the Sandbox is updated .... log messages: With that I can say we are good to go. |
@vkuznet @amaltaro , just a minor question here before we enable this in production tomorrow. After I've checked all the logs and confirmed that the logic is triggered upon a p fraction update as explained by the documentation, I decided to also do a double check of the so updated
But what I can see in the sandbox is this:
But the workflow completed successfully. Could that be attributed to the fact I have edited the pilup fraction at the time the jobs have been already running at condor? |
@todor-ivanov , there are several conditions for content modification of I'll leave a final call to @amaltaro , may be we can rerun again the test with a different workflow. In that case I suggest to provide details of the node, logs, etc. that we can look at it. |
And I was not wrong.... even before I get my second test fully initiated. I did see the empty
And I think I know where the error comes from. I am now running this block of code in standalone mode at WMCore/src/python/WMComponent/WorkflowUpdater/WorkflowUpdaterPoller.py Lines 375 to 407 in 9e20b58
And here is the sequence we enter:
While the list of pileup documents we obtain in the following way:
So long story short, we are missing the This leads us to the second |
The suggested fix from me is here: #11974 I would appreciate if both of you @amaltaro and @todor-ivanov will have a look at this. In particular Todor's comment #11736 (comment) and my reply #11736 (comment) |
Hi @vkuznet
Was my initial idea of fixing the second part of the issue as well. But then I realized this way we would change the structure of the returned object: We are about to pass to
And we are about to return an empty On the other part, I am interested to know why we try to access a field in the pileup document which does not exists:
And I checked in the current implementation this key is missing for all pileups known to the service |
Todor, the MSPileup document does not have blocks. I already pointed out that in code we don't use this document per-se but rather create a different structure from it which contains the |
thanks @vkuznet and @amaltaro , actually Alan and you both pointed to one and the same place:
So the blocks in the Pu documents are created here:
and later populated here: WMCore/src/python/WMComponent/WorkflowUpdater/WorkflowUpdaterPoller.py Lines 303 to 304 in 9e20b58
This was the part I was missing. and instead I was using the upper level method for getPileupDocs() not the method define within WworkflowUpdaterPoller class. So this explains why in my tests I was missing the block fields. In this case your fix should be working perfectly Valya. I'll patch and test later tonight.
|
vocms0193 is now patched and the proposed fix at #11974 is under test. Here is the test workflow: SC_MultiPU_Feb2024_Val_PartPU_v10_240425_082716_7943 |
HI @vkuznet @amaltaro, Here are the results: I changed its fraction to 0.7 with the following command:
And here is the resultant record at MSPileup:
Then, restarted the
But with the current change, it did not update the Sandbox tarball.... because it did not find any block information yet again. Here is the log message from the blocked step:
And I am comparing the two tarbals (before and after the edit). I can see that we indeed protected ourselves from ending up with empty |
hi @vkuznet @amaltaro,
But if I get to change the
Same for the single block PU as well:
|
and indeed here is the shorter version of these messages printed from
|
Moving painfully slow ahead:
And as can clearly be seen - it takes 0.0 secs and the blocks for all of them are empty.... and no error or what so ever has been thrown. So no indication on what has happened in the background. |
I think I know what the problem is.
Upon adding the following printout at this very line:
Here is what I see in the logs:
while the whole
|
ok, getting closer here: Here is ho
Applying line:
Here is how the child objects looks like:
So it is obvious, the result is neither unique nor a flat list. And cross checking everything manually (up to now it was all just live printouts from the code at the agent), we can see that there is no way to match a string value in a list of something, which looks like a set of string values:
While the whole object from inside the list matches. So we are actually never updating any block level information from Rucio at all. I believe this whole is due to the heavily nested structure coming from the WMCore/src/python/WMCore/WMSpec/WMWorkload.py Lines 1428 to 1458 in 9e20b58
WorkflowUpdaterPoler :
|
and indeed the values from WMCore/src/python/WMCore/WMSpec/WMWorkload.py Line 1451 in 9e20b58
and checking it manually confirms it:
So we are flattening this |
@todor-ivanov I think you are looking at the wrong place. This has all been resolved and the place where we have a unique list of pileup dataset names is in this line: You are looking at premature lines, which indeed have a different data structure than the one expected by |
Well I beg to disagree..... it matters and let me show you why. It matters, because we are putting the values in the
And let me paste what I get out of this very line in:
And now if I go and manually remove all this and make them be populated just as normal sets:
find the difference ;) :
I believe the second one is the expected one in the rest of the code. I am preparing a patch. |
@amaltaro @vkuznet here is the solution to our problems. All we need to do is to properly iterate trough the whole view as returned by ``. if we do so all our data structures get in order. Here are the printouts after applying my patch:
(BTW this And later in the process:
then the
And finally the
|
indeed here is how this
instead of:
|
AAAnnnddd ... attention... attention all.... here is the golden fix: ;) Tadaaaa: |
After patching vocm0193, and sending yet another test workflow and changing the PU fraction, here is whate we get
@amaltaro @vkuznet I think we are good to go. Alan feel free to deploy in production on your convenience. Enabling and usage of the functionality is a different story though. I remember noticing a message from P&R in this regard. |
Great! Thank you for the validation and confirmation, Todor. I am now cutting a final If there is nothing pending for this ticket, shall we close this out? |
Honor me with the pleasure to pull that plug, please! |
Impact of the new feature
WMCore in general
Is your feature request related to a problem? Please describe.
At this stage, we are supposed to have the whole partial pileup functionality in place, with dev environment tests already done
Describe the solution you'd like
Carefully run integration level tests with all the partial pileup sub-features in place, ensuring that:
Describe alternatives you've considered
None
Additional context
This is part of the meta issue: #11537
**** UPDATE ****
The following pull requests have been provided to resolve issues identified during this integration and validation phase:
#11948
#11971
#11972
#11975
The text was updated successfully, but these errors were encountered: