-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combines block records in single dict that includes all associated wfs #11245
Conversation
a3b0d59
to
354acab
Compare
Jenkins results:
|
354acab
to
230ed22
Compare
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
230ed22
to
8c3cee9
Compare
Jenkins results:
|
8c3cee9
to
14c9915
Compare
Jenkins results:
|
@germanfgv I am not sure I fully undertand the problem you reported. However, looking into this line that you implemented:
it makes me think that, regardless whether we have your changes in place or not, Unless completedWorkflows is always fully built before deletableWorkflows, then in this case yes, it would only evaluate to True once all the completedWorkflows are also deletableWorkflows. |
Jenkins results:
|
Jenkins results:
|
Maybe this concern is addressed with my second commit. Before, we were disregarding incomplete workflows in the query due to Because of this, you are right that By removing the completion check in the query in 0bf4285, now we can properly avoid deleting blocks that belong to incomplete workflows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @germanfgv Thanks for the suggested change.
I am still having some concerns about this extra aggregation that is needed in the python code. This is all a sign that we are ignoring the uniqueness of the records as they come from the CROUP BY
SQL statement, which to me seems to be dangerous.
In parallel to the already existing changes, which as far as I understood are already working in one of the T0 machines in production, I'd like to be able to test few of those SQL queries and DAO outcomes myself in one of the machines affected by this bug, if you do not mind.
@@ -49,7 +49,6 @@ class GetCompletedBlocks(DBFormatter): | |||
dbsbuffer_dataset_subscription.site, | |||
dbsbuffer_workflow.name, | |||
dbsbuffer_block.create_time | |||
HAVING COUNT(*) = SUM(dbsbuffer_workflow.completed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you removing this protection here? IIRC this is meant to assure all the workflows associated with a block are in status completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was useful in GetDeletableBlocks.py because that query was not grouping by workflow name. Here, if you exclude records with workflows that are not complete, you will still have other records with other workflows associated to the block, so the block will still end up in the results, even with uncompleted associated workflows.
Even worst, as we are excluding the uncompleted workflows from the query, later we are not taking thos uncompleted workflows into account when we compare against the deletableWfsDict
list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because that query was not grouping by workflow name
That is correct.
I have double checked now, that introducing the dbsbuffer_workflow.name
in the SELECT
and GROUP BY
conditions splits the grouping further down, creating multiple groups per block - one for every associated workflow. And then it does count the completed workflows but they enter the query every time a workflow completes..... so if we have 1 block with 3 associated PromptReco requests out of which only two are completed, we will have that block returned 2 times in the output. Once the third Prompt Reco is complete the same block would enter the output result 3 times.
This makes the rest of the logic quite hard to follow indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And since it is quite subtle detail and difficult to understand and remember. Let me put some concrete examples:
Here is the output from the current SQL query with the counts per group included in the output (filtered for a single block):
SQL> SELECT
2 count(*),
3 dbsbuffer_block.blockname,
4 dbsbuffer_location.pnn,
5 dbsbuffer_dataset.path,
6 dbsbuffer_dataset_subscription.site,
7 dbsbuffer_workflow.name,
8 dbsbuffer_block.create_time
9 FROM dbsbuffer_dataset_subscription
10 INNER JOIN dbsbuffer_dataset ON
11 dbsbuffer_dataset.id = dbsbuffer_dataset_subscription.dataset_id
12 INNER JOIN dbsbuffer_block ON
13 dbsbuffer_block.dataset_id = dbsbuffer_dataset_subscription.dataset_id
14 INNER JOIN dbsbuffer_file ON
15 dbsbuffer_file.block_id = dbsbuffer_block.id
16 INNER JOIN dbsbuffer_workflow ON
17 dbsbuffer_workflow.id = dbsbuffer_file.workflow
18 INNER JOIN dbsbuffer_location ON
19 dbsbuffer_location.id = dbsbuffer_block.location
20 WHERE dbsbuffer_dataset_subscription.delete_blocks = 1
21 AND dbsbuffer_dataset_subscription.subscribed = 1
22 AND dbsbuffer_block.status = 'Closed'
23 AND dbsbuffer_block.deleted = 0
24 AND dbsbuffer_block.blockname = '/ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00'
25 GROUP BY dbsbuffer_block.blockname,
26 dbsbuffer_location.pnn,
27 dbsbuffer_dataset.path,
28 dbsbuffer_dataset_subscription.site,
29 dbsbuffer_workflow.name,
30 dbsbuffer_block.create_time
31 HAVING COUNT(*) = SUM(dbsbuffer_workflow.completed)
32 ORDER BY dbsbuffer_workflow.name;
COUNT(*) BLOCKNAME PNN PATH SITE NAME CREATE_TIME
---------- ------------------------------------------------------------------------------------------------------------------------ --------------- ---------------------------------------- ------------------ ---------------------------------------- -----------
82 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_Disk PromptReco_Run357080_ZeroBias 1660249466
82 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_MSS PromptReco_Run357080_ZeroBias 1660249466
Now lets release the HAVING COUNT(*) = SUM(dbsbuffer_workflow.competed)
constraint:
SQL> SELECT
2 count(*),
3 dbsbuffer_block.blockname,
4 dbsbuffer_location.pnn,
5 dbsbuffer_dataset.path,
6 dbsbuffer_dataset_subscription.site,
7 dbsbuffer_workflow.name,
8 dbsbuffer_block.create_time
9 FROM dbsbuffer_dataset_subscription
10 INNER JOIN dbsbuffer_dataset ON
11 dbsbuffer_dataset.id = dbsbuffer_dataset_subscription.dataset_id
12 INNER JOIN dbsbuffer_block ON
13 dbsbuffer_block.dataset_id = dbsbuffer_dataset_subscription.dataset_id
14 INNER JOIN dbsbuffer_file ON
15 dbsbuffer_file.block_id = dbsbuffer_block.id
16 INNER JOIN dbsbuffer_workflow ON
17 dbsbuffer_workflow.id = dbsbuffer_file.workflow
18 INNER JOIN dbsbuffer_location ON
19 dbsbuffer_location.id = dbsbuffer_block.location
20 WHERE dbsbuffer_dataset_subscription.delete_blocks = 1
21 AND dbsbuffer_dataset_subscription.subscribed = 1
22 AND dbsbuffer_block.status = 'Closed'
23 AND dbsbuffer_block.deleted = 0
24 AND dbsbuffer_block.blockname = '/ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00'
25 GROUP BY dbsbuffer_block.blockname,
26 dbsbuffer_location.pnn,
27 dbsbuffer_dataset.path,
28 dbsbuffer_dataset_subscription.site,
29 dbsbuffer_workflow.name,
30 dbsbuffer_block.create_time
31 ORDER BY dbsbuffer_workflow.name;
COUNT(*) BLOCKNAME PNN PATH SITE NAME CREATE_TIME
---------- ------------------------------------------------------------------------------------------------------------------------ --------------- ---------------------------------------- ------------------ ---------------------------------------- -----------
82 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_Disk PromptReco_Run357080_ZeroBias 1660249466
82 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_MSS PromptReco_Run357080_ZeroBias 1660249466
77 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_Disk PromptReco_Run357081_ZeroBias 1660249466
77 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_MSS PromptReco_Run357081_ZeroBias 1660249466
2 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_Disk PromptReco_Run357096_ZeroBias 1660249466
2 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_MSS PromptReco_Run357096_ZeroBias 1660249466
8 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_Disk PromptReco_Run357098_ZeroBias 1660249466
8 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_MSS PromptReco_Run357098_ZeroBias 1660249466
7 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_Disk PromptReco_Run357100_ZeroBias 1660249466
7 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_MSS PromptReco_Run357100_ZeroBias 1660249466
13 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_Disk PromptReco_Run357101_ZeroBias 1660249466
13 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_MSS PromptReco_Run357101_ZeroBias 1660249466
9 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_Disk PromptReco_Run357102_ZeroBias 1660249466
9 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_MSS PromptReco_Run357102_ZeroBias 1660249466
1 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_Disk PromptReco_Run357106_ZeroBias 1660249466
1 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_MSS PromptReco_Run357106_ZeroBias 1660249466
2 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_Disk PromptReco_Run357112_ZeroBias 1660249466
2 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_MSS PromptReco_Run357112_ZeroBias 1660249466
18 rows selected.
It is obvious that this block would continue entering the output of the original query every time a PromptReco completes until the result of the original query does not give exactly 18 records as the one above.
And indeed the expected behavior is achieved once we have the field dbsbuffer_workflow.name
, which is further splitting the groups, removed.
Here is the query without the extra splitting and without the constraint for all the associated workflows to be completed as before:
SQL> SELECT
2 count(*),
3 dbsbuffer_block.blockname,
4 dbsbuffer_location.pnn,
5 dbsbuffer_dataset.path,
6 dbsbuffer_dataset_subscription.site,
7 dbsbuffer_block.create_time
8 FROM dbsbuffer_dataset_subscription
9 INNER JOIN dbsbuffer_dataset ON
10 dbsbuffer_dataset.id = dbsbuffer_dataset_subscription.dataset_id
11 INNER JOIN dbsbuffer_block ON
12 dbsbuffer_block.dataset_id = dbsbuffer_dataset_subscription.dataset_id
13 INNER JOIN dbsbuffer_file ON
14 dbsbuffer_file.block_id = dbsbuffer_block.id
15 INNER JOIN dbsbuffer_workflow ON
16 dbsbuffer_workflow.id = dbsbuffer_file.workflow
17 INNER JOIN dbsbuffer_location ON
18 dbsbuffer_location.id = dbsbuffer_block.location
19 WHERE dbsbuffer_dataset_subscription.delete_blocks = 1
20 AND dbsbuffer_dataset_subscription.subscribed = 1
21 AND dbsbuffer_block.status = 'Closed'
22 AND dbsbuffer_block.deleted = 0
23 AND dbsbuffer_block.blockname = '/ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00'
24 GROUP BY dbsbuffer_block.blockname,
25 dbsbuffer_location.pnn,
26 dbsbuffer_dataset.path,
27 dbsbuffer_dataset_subscription.site,
28 dbsbuffer_block.create_time ;
COUNT(*) BLOCKNAME PNN PATH SITE CREATE_TIME
---------- ------------------------------------------------------------------------------------------------------------------------ --------------- ---------------------------------------- ------------------ -----------
201 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_Disk 1660249466
201 /ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00 T0_CH_CERN_Disk /ZeroBias/Run2022C-PromptReco-v1/AOD T1_US_FNAL_MSS 1660249466
And here is the result if we apply the constraint for workflow completion.
SQL> SELECT
2 count(*),
3 dbsbuffer_block.blockname,
4 dbsbuffer_location.pnn,
5 dbsbuffer_dataset.path,
6 dbsbuffer_dataset_subscription.site,
7 dbsbuffer_block.create_time
8 FROM dbsbuffer_dataset_subscription
9 INNER JOIN dbsbuffer_dataset ON
10 dbsbuffer_dataset.id = dbsbuffer_dataset_subscription.dataset_id
11 INNER JOIN dbsbuffer_block ON
12 dbsbuffer_block.dataset_id = dbsbuffer_dataset_subscription.dataset_id
13 INNER JOIN dbsbuffer_file ON
14 dbsbuffer_file.block_id = dbsbuffer_block.id
15 INNER JOIN dbsbuffer_workflow ON
16 dbsbuffer_workflow.id = dbsbuffer_file.workflow
17 INNER JOIN dbsbuffer_location ON
18 dbsbuffer_location.id = dbsbuffer_block.location
19 WHERE dbsbuffer_dataset_subscription.delete_blocks = 1
20 AND dbsbuffer_dataset_subscription.subscribed = 1
21 AND dbsbuffer_block.status = 'Closed'
22 AND dbsbuffer_block.deleted = 0
23 AND dbsbuffer_block.blockname = '/ZeroBias/Run2022C-PromptReco-v1/AOD#d54f99e8-9d28-4567-921f-edce2a207f00'
24 GROUP BY dbsbuffer_block.blockname,
25 dbsbuffer_location.pnn,
26 dbsbuffer_dataset.path,
27 dbsbuffer_dataset_subscription.site,
28 dbsbuffer_block.create_time
29 HAVING COUNT(*) = SUM(dbsbuffer_workflow.completed) ;
no rows selected
It returns no rows as expected, because there are still workflows associated with this block which are not completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid this same faulty logic applies for the extra granularity added by the multiple sites as well as with the multiple associated workflows
dictResults = {} | ||
for record in listResults: | ||
# Populates results dict and adds all workflows of the same block to a single record | ||
blockname = record['blockname'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use cammelCase for the python variables.
for record in listResults: | ||
# Populates results dict and adds all workflows of the same block to a single record | ||
blockname = record['blockname'] | ||
if blockname in dictResults: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general the DAO must return the list of results as it is from the DABASE and no extra data reformatting should be needed. It seems to me that if this extra aggregation by block name here in the python code is needed, then something is not quite working as we expect in the GROUP BY
clause in the SQL query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do this aggregation outside of the DAO. Also, we may be able to modify the query so it ignores blocks with blocks that are not completed, but then that's probably redundant with the deletableWorkflows query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @germanfgv! We largely discussed the problem an tested multiple times with you, and we now know in details what is going on. I'd suggest we continue as it is. And also since this DAO is used at a single place in the WMCore code, my previous remark that we should avoid the aggregation in the DAO and leave it at the component would only make the code at the component more unreadable. So lets proceed with what we already have. Just put the comments, as we discussed during our meeting, mentioning what are the background condition we rely on in order for that logic to work. So that we do remember those little but extremely important details. Thanks!
blockname = record['blockname'] | ||
if blockname in dictResults: | ||
dictResults[blockname]['workflowNames'].add(record['name']) | ||
dictResults[blockname]['sites'].add(record['site']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this line here would just add the site to the set of sites of an already existing block name, skipping the values of the rest of the columns in the record. I am not sure this is safe. Are we not losing information here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other columns are block name, location, dataset, and creation time. All of those are the same for each block, regardless of the subscription in the record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, put this little detail at a comment in the code, please.
src/python/WMComponent/RucioInjector/Database/MySQL/GetCompletedBlocks.py
Show resolved
Hide resolved
# Populates results dict and adds all workflows of the same block to a single record | ||
blockname = record['blockname'] | ||
if blockname in dictResults: | ||
dictResults[blockname]['workflowNames'].add(record['name']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as bellow.
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @germanfgv , please squash those commits in a single one so we can merge it.
src/python/WMComponent/RucioInjector/Database/MySQL/GetCompletedBlocks.py
Show resolved
Hide resolved
@germanfgv @todor-ivanov thanks for following this up to the end. In the future, please keep in mind that:
|
Hi @amaltaro, This must have been my fault. We agreed with @germanfgv for this PR he should skip the squash step because no unittest related commits were needed, but instead I was supposed to push the |
Oh, looking at the master commit history, it indeed did the job and there is only 1 commit (I had looked only at the PR itself). |
This actually is a good lesson not to touch this button any more (I will put some red tape on my screen right where it is :) ) |
Fixes #11244
Status
Ready
Description
Turn list of blocks into a dictionary to avoid duplicate registers for single block. Turns 'workflowname' into a set of workflows that generated the block
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
None
External dependencies / deployment changes
None