-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add extra protection for T0 to prevent archival of workflows having blocks not yet deleted #11169
Add extra protection for T0 to prevent archival of workflows having blocks not yet deleted #11169
Conversation
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
17f2cf9
to
ddc5b0b
Compare
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
d8b247c
to
a115a17
Compare
Jenkins results:
|
Jenkins results:
|
a115a17
to
f4ef897
Compare
Jenkins results:
|
f4ef897
to
1001d1c
Compare
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code represents general issue with WMCore handling database data. The self.dbi.processData
returns a list of records from a database, this is original copy of data which will be kept in memory. Then, on line 88 you already convert the list to dict, which is another copy of the same data. Then, you provide some code login, e.g. lines 99-105 to create yet another copy of the data. Therefore, a single class will require 3 times more memory which is data represents since the entire process copies the same data over and over into different representation. I understand the legacy of this approach, but I want to point out on concrete example how and why memory footprint is increased in WMCore code handling database data. To avoid this overhead, a streaming approach may be adapted, but it requires the following steps:
- the dbi interface should return a generator
- the upstream classes should not create intermediate objects but rather have processing pipeline which consume generator, process it and yield generator again to upstream code
- I would assume that there is another class which will consumes this class' data and yield it back to the client. Therefore, if this class provide generators the data from ORACLE will be streamed to the client where this and similar classes will only provide processing pipeline. But in such approach someone should care about time of the processing pipeline. since it will hold ORACLE connection. Here this is not an issue since data from
dbi
are converted to list and therefore ORACLE connection is closed. But it leads to larger memory footprint.
src/python/WMCore/WMBS/MySQL/Workflow/GetDeletedBlocksByWorkflow.py
Outdated
Show resolved
Hide resolved
src/python/WMCore/WMBS/MySQL/Workflow/GetDeletedBlocksByWorkflow.py
Outdated
Show resolved
Hide resolved
Here is how the test are ongoing. What we did with @germanfgv was:
Here is the output from
We can clearly see 3 repack workflows:
Out of which two has no dependent workflows and were having all blocks deleted, while one of them is on
The first two repack workflows were deleted and archived, but the third one is still persisting in WMBS. !!! NOTE: !!! Upon restarting the ProptReco workflows the picture from the DAO output has changed to:
So what happens at this stage is that all the PromptReco workflows are
So, as can be seen, those are already considered as deletable, while the parent And here is the log message from
What we are about to do now is to decrease the
FYI: @amaltaro @klannon @vkuznet @drkovalskyi @germanfgv @khurtado |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todor, please see some comments in the code.
Doing some grep in the repository, I do not see any mention of dbsbuffer
in the WMBS factory, so I'd rather have this new DAO under the DBS3Buffer
package. I also have some strong suggestion of improvements to the DAO.
Valentin has a very good point on how we deal with the relational database data, but I'd rather track/address that in a different issue, in the future when it comes up to our priority list.
src/python/WMCore/WMBS/MySQL/Workflow/GetDeletedBlocksByWorkflow.py
Outdated
Show resolved
Hide resolved
src/python/WMCore/WMBS/MySQL/Workflow/GetDeletedBlocksByWorkflow.py
Outdated
Show resolved
Hide resolved
Thanks @amaltaro and @vkuznet for your reviews. I am alreasdy working on some of your comments.
I did hesitate a lot to put the DAO in DBS3Buffer package, but then I had some second thoughts. Because this DAO is having the workflow names as main selector for the aggregations and is basically taking into account if a workflow is or is not present in wmbs which is a really important detail in this DAO. So at the end I've put it in WMBS.
So for this I know very well what you were talking about. When working on it I did go through the same path you already suggest and I was aware of the price we need to pay for making part of the aggregations in python. But for that I tried to make a more detailed explanation in reply to your inline comment. Please take a look and lets have some discussion if you want offline so we decide how to proceed. Both ways are fine with me, but one of them is much more difficult to convince myself is giving proper results and trust it. |
@todor-ivanov I think this query will give the results that you need:
If the query providing blocknames is going to be used mostly for debugging, then I'd rather not use that in production (which does this same query every couple of minutes). So, feel free to provide it in this PR as well such that T0 team can run it if debugging is needed. |
Thanks @amaltaro for looking again into this.
Sorry Alan, but I think what we would expect from these counts is not what would be counted in reality. The fact that we do not GROUP BY blockname only masks some important details. To prove myself correct here, lets just change the selection requirement to simply
Look at:
I would not expect this Express workflow to have 36 blocks. So lets then group by blockname:
Looking at:
This clearly shows now the actual number of blocks for this Express is 7. And some of those entries simply get into the result more than once. And I do not expect a single block to be deleted 6 or 5 times, this is because of the number of files in the blocks and because we do go through a JOIN with Now lets go further and then add the JOIN with
I still think the only way of doing this is through nested select statements as I tested in my previous comment. I know the query looks much more cumbersome and heavy, but at least gives the results that we expect. |
After a long private discussion we were having with @amaltaro about these queries, we did agree on minimizing them to the last bit possible. Which means:
So the final one we agreed upon is:
It was crosschecked by temprarily implementing the inner join with the |
Jenkins results:
|
Jenkins results:
|
Thanks for your reviews @vkuznet @amaltaro I did try to simplify and make the code change in this PR according to all comments and offline discussions we had. So please feel free to take a look at the latest version again at your convenience. |
Jenkins results:
|
e96364a
to
3f7576e
Compare
Jenkins results:
|
Jenkins results:
|
Test this please |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@todor-ivanov , could you please resolve all conversation which are done to simplify review process
src/python/WMCore/WMBS/MySQL/Workflow/GetDeletedBlocksByWorkflow.py
Outdated
Show resolved
Hide resolved
src/python/WMCore/WMBS/MySQL/Workflow/GetDeletedBlocksByWorkflow.py
Outdated
Show resolved
Hide resolved
src/python/WMCore/WMBS/MySQL/Workflow/GetDeletedBlocksByWorkflow.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@todor-ivanov those unit tests are likely failing due to a problem/intervention on the Rucio integration server, but please talk to Eric to see whether we can have those succeeding before this PR gets merged.
I also left a few comments along the code for your consideration.
src/python/WMComponent/DBS3Buffer/MySQL/CountUndeletedBlocksByWorkflow.py
Show resolved
Hide resolved
src/python/WMComponent/DBS3Buffer/MySQL/CountUndeletedBlocksByWorkflow.py
Outdated
Show resolved
Hide resolved
src/python/WMComponent/DBS3Buffer/Oracle/CountUndeletedBlocksByWorkflow.py
Outdated
Show resolved
Hide resolved
@@ -380,6 +396,16 @@ def deleteWorkflowFromWMBSAndDisk(self): | |||
deletableWorkflowsDAO = self.daoFactory(classname="Workflow.GetDeletableWorkflows") | |||
deletablewfs = deletableWorkflowsDAO.execute() | |||
|
|||
# For T0 subtract the workflows which are not having all their blocks deleted yet: | |||
if not self.useReqMgrForCompletionCheck: | |||
undeletedBlocksByWorkflowDAO = self.dbsDaoFactory(classname="CountUndeletedBlocksByWorkflow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to make it an object instance and initialize it in the setup
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see a reason why to spawn these changes in many places, since the rule of thumb for the whole CleanCouchPoller
module (at least as far as I notice) is:
- create the DAO instance where (in the method) it is needed
- execute the DAO in place (right after instantiating it)
I know in other modules the practice is different. But we would not gain in terms of CPU or anything if we move that in the object instance. And this DAO is about to be used here and here only. While keeping the whole thing well grouped in one place seems to me more logical and easy to link which is what, and where it is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does take a few extra CPU cycles to try to instantiate it every cycle, but I have strong preference here.
src/python/WMCore/WMBS/MySQL/Workflow/GetDeletedBlocksByWorkflow.py
Outdated
Show resolved
Hide resolved
src/python/WMCore/WMBS/MySQL/Workflow/GetDeletedBlocksByWorkflow.py
Outdated
Show resolved
Hide resolved
Jenkins results:
|
src/python/WMCore/WMBS/MySQL/Workflow/GetDeletedBlocksByWorkflow.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me, Todor. Can you please squash these commits accordingly?
@@ -380,6 +396,16 @@ def deleteWorkflowFromWMBSAndDisk(self): | |||
deletableWorkflowsDAO = self.daoFactory(classname="Workflow.GetDeletableWorkflows") | |||
deletablewfs = deletableWorkflowsDAO.execute() | |||
|
|||
# For T0 subtract the workflows which are not having all their blocks deleted yet: | |||
if not self.useReqMgrForCompletionCheck: | |||
undeletedBlocksByWorkflowDAO = self.dbsDaoFactory(classname="CountUndeletedBlocksByWorkflow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does take a few extra CPU cycles to try to instantiate it every cycle, but I have strong preference here.
Thanks @amaltaro I am squashing it now. |
…locks not yet deleted. Add GetDeletedBlocksByWorkflow DAO to WMBS. Aggregate all results in the DAO per workflowName Add extra protection for T0 at cleanCouchPoller Typo Update docstrings and log messages Remove redundant statements: Remove redundant range() start argument Remove redundant pass statement Remove redundant DISTINCT statement Typo Add CountUndeletedBlocksByWorkflow DAO && Decrease execution complexity in workflows with undeleted blocks check. Change log level to info. remove keynames remapping from GetDeletedBlocksByWorkflow DAO Pylint fixes. Review fixes
df62783
to
71d225c
Compare
Jenkins results:
|
Thanks Todor. Whenever it's ready, please do remove the "Do not merge" tag. I know it's hence, I removed it myself. |
thanks @amaltaro |
Fixes #11154
Status
Ready
Description
With the current PR an additional check of the
deleted
flag for all blocks per workflow is added toCleanCouchPoller
in order to protect from archiving workflows which are still having blocks not deleted and leaving orphaned blocks behind that way. The additional check should be applied only for T0 agents and should not affect standard production system.Is it backward compatible (if not, which system it affects?)
YES
Related PRs
None
External dependencies / deployment changes
None