Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extra protection for T0 to prevent archival of workflows having blocks not yet deleted #11169

Conversation

todor-ivanov
Copy link
Contributor

@todor-ivanov todor-ivanov commented May 31, 2022

Fixes #11154

Status

Ready

Description

With the current PR an additional check of the deleted flag for all blocks per workflow is added to CleanCouchPoller in order to protect from archiving workflows which are still having blocks not deleted and leaving orphaned blocks behind that way. The additional check should be applied only for T0 agents and should not affect standard production system.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 1 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13259/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 4 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 6 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 6 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13260/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor Author

Hi @amaltaro @vkuznet The so proposed solution here is about to be tested (tomorrow hopefuly), but in case you may have a minute or two please take a quick look at this PR.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 4 tests deleted
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 6 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 36 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13261/artifact/artifacts/PullRequestReport.html

@todor-ivanov todor-ivanov force-pushed the bugfix_T0_RaceConditionArchiveDelay_fix-11154 branch from 17f2cf9 to ddc5b0b Compare June 1, 2022 05:53
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 20 warnings and errors that must be fixed
    • 15 warnings
    • 99 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 36 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13262/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 21 warnings and errors that must be fixed
    • 15 warnings
    • 98 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 36 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13263/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
  • Python3 Pylint check: failed
    • 20 warnings and errors that must be fixed
    • 15 warnings
    • 98 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 36 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13264/artifact/artifacts/PullRequestReport.html

@todor-ivanov todor-ivanov force-pushed the bugfix_T0_RaceConditionArchiveDelay_fix-11154 branch 2 times, most recently from d8b247c to a115a17 Compare June 1, 2022 07:30
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
  • Python3 Pylint check: failed
    • 20 warnings and errors that must be fixed
    • 15 warnings
    • 98 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 36 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13265/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 20 warnings and errors that must be fixed
    • 15 warnings
    • 98 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 36 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13266/artifact/artifacts/PullRequestReport.html

@todor-ivanov todor-ivanov force-pushed the bugfix_T0_RaceConditionArchiveDelay_fix-11154 branch from a115a17 to f4ef897 Compare June 1, 2022 07:59
@todor-ivanov todor-ivanov changed the title Add GetDeletedBlocksByWorkflow DAO to WMBS. Add extra protection for T0 to prevent archival of workflows having blocks not yet deleted Jun 1, 2022
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 20 warnings and errors that must be fixed
    • 15 warnings
    • 98 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 36 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13267/artifact/artifacts/PullRequestReport.html

@todor-ivanov todor-ivanov force-pushed the bugfix_T0_RaceConditionArchiveDelay_fix-11154 branch from f4ef897 to 1001d1c Compare June 1, 2022 08:27
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 20 warnings and errors that must be fixed
    • 15 warnings
    • 98 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 36 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13268/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@vkuznet vkuznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code represents general issue with WMCore handling database data. The self.dbi.processData returns a list of records from a database, this is original copy of data which will be kept in memory. Then, on line 88 you already convert the list to dict, which is another copy of the same data. Then, you provide some code login, e.g. lines 99-105 to create yet another copy of the data. Therefore, a single class will require 3 times more memory which is data represents since the entire process copies the same data over and over into different representation. I understand the legacy of this approach, but I want to point out on concrete example how and why memory footprint is increased in WMCore code handling database data. To avoid this overhead, a streaming approach may be adapted, but it requires the following steps:

  • the dbi interface should return a generator
  • the upstream classes should not create intermediate objects but rather have processing pipeline which consume generator, process it and yield generator again to upstream code
  • I would assume that there is another class which will consumes this class' data and yield it back to the client. Therefore, if this class provide generators the data from ORACLE will be streamed to the client where this and similar classes will only provide processing pipeline. But in such approach someone should care about time of the processing pipeline. since it will hold ORACLE connection. Here this is not an issue since data from dbi are converted to list and therefore ORACLE connection is closed. But it leads to larger memory footprint.

@todor-ivanov
Copy link
Contributor Author

Here is how the test are ongoing. What we did with @germanfgv was:

  • We were having an agent with a replay already running and having all the PromptReco workflows paused.
  • Increased blockDeletionDelayHours to 100h
  • Decrease archiveDelayHours to 1h
  • Restarted the promptReco workflows

Here is the output from GetDeletedBlocksByWorkflow DAO at that stage:

[{'blocksDeleted': ['/RPCMonitor/Tier0_REPLAY_2022-v425/RAW#fbc62f6f-45c6-430d-a555-66950179e1b0'],
  'blocksNotDeleted': [],
  'workflowName': 'Repack_Run351572_StreamRPCMON_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426'},
 {'blocksDeleted': [],
  'blocksNotDeleted': ['/Cosmics/Tier0_REPLAY_2022-v425/RAW#e8bec12f-3f00-44a5-ae9e-b666a7ace691',
                       '/NoBPTX/Tier0_REPLAY_2022-v425/RAW#4d343363-8dca-48ef-934f-8f85f7658a35',
                       '/HcalNZS/Tier0_REPLAY_2022-v425/RAW#0723e46e-5eb2-4605-96c5-e8128135acef',
                       '/HLTPhysics/Tier0_REPLAY_2022-v425/RAW#edde056d-4050-41a8-a02d-c6d6f827d9b2',
                       '/MinimumBias/Tier0_REPLAY_2022-v425/RAW#e9167ce5-94c3-48be-87eb-48785907df9f'],
  'workflowName': 'Repack_Run351572_StreamPhysics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426'},
 {'blocksDeleted': ['/StreamExpressCosmics/Tier0_REPLAY_2022-Express-v425/DQMIO#d2819fcf-bac2-4fbb-80cf-11b5bc9eeb00',
                    '/StreamExpressCosmics/Tier0_REPLAY_2022-TkAlCosmics0T-Express-v425/ALCARECO#34693f19-334f-45e3-9bcc-71011ed5141e',
                    '/StreamExpressCosmics/Tier0_REPLAY_2022-SiStripPCLHistos-Express-v425/ALCARECO#1a539e31-5b51-4bba-ade2-6e227fafe577',
                    '/StreamExpressCosmics/Tier0_REPLAY_2022-SiPixelCalZeroBias-Express-v425/ALCARECO#9d9c5727-436f-4007-8c2b-38d2f265b07d',
                    '/ExpressCosmics/Tier0_REPLAY_2022-Express-v425/FEVT#e1659fff-a2be-49ca-aafc-21fe4f80664d',
                    '/StreamExpressCosmics/Tier0_REPLAY_2022-SiStripCalZeroBias-Express-v425/ALCARECO#4cb4a954-a96c-43b8-9f2e-2ec5f92c06dd',
                    '/StreamExpressCosmics/Tier0_REPLAY_2022-PromptCalibProdSiStrip-Express-v425/ALCAPROMPT#ae2bc211-4092-4e0b-b26a-681532757405'],
  'blocksNotDeleted': [],
  'workflowName': 'Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426'},
 {'blocksDeleted': ['/TestEnablesEcalHcal/Tier0_REPLAY_2022-Express-v425/RAW#8f57d4e5-1a6c-48ae-9fd4-5633687a23ab',
                    '/StreamCalibration/Tier0_REPLAY_2022-EcalTestPulsesRaw-Express-v425/ALCARECO#0afc42d0-09b9-4675-a9d7-35107dcb9608',
                    '/StreamCalibration/Tier0_REPLAY_2022-PromptCalibProdEcalPedestals-Express-v425/ALCAPROMPT#f3e73f7a-b916-436f-b5b1-7852a7578fff',
                    '/StreamCalibration/Tier0_REPLAY_2022-Express-v425/DQMIO#04991f93-9f8e-41b7-82b7-bbefbb4c1322'],
  'blocksNotDeleted': [],
  'workflowName': 'Express_Run351572_StreamCalibration_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426'},
 {'blocksDeleted': ['/L1Accept/Tier0_REPLAY_2022-v425/RAW#976c74e9-873c-40fc-9d98-66f4d173bdd5'],
  'blocksNotDeleted': [],
  'workflowName': 'Repack_Run351572_StreamNanoDST_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426'}]

We can clearly see 3 repack workflows:

  • Repack_Run351572_StreamRPCMON_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426
  • Repack_Run351572_StreamPhysics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426
  • Repack_Run351572_StreamNanoDST_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426

Out of which two has no dependent workflows and were having all blocks deleted, while one of them is on StreamPhysics and is having all the rest of the PromptReco workflows depending on it. Here is the ouptut of the sql query for searching all workflows with child workflows not yet completed:

SQL>     SELECT DISTINCT ww.name FROM wmbs_workflow ww
  2      INNER JOIN wmbs_subscription ws ON
  3          ws.workflow = ww.id
  4      INNER JOIN wmbs_fileset wfs ON
  5          wfs.id = ws.fileset
  6      INNER JOIN wmbs_fileset_files wfsf ON
  7          wfsf.fileset = wfs.id
  8      INNER JOIN wmbs_file_parent wfp ON
  9          wfp.parent = wfsf.fileid
 10      INNER JOIN wmbs_fileset_files child_fileset ON
 11          child_fileset.fileid = wfp.child
 12      INNER JOIN wmbs_subscription child_subscription ON
 13          child_subscription.fileset = child_fileset.fileset
 14      WHERE child_subscription.finished = 0;


NAME
----------------------------------------------------------------------------------------------------------------------------------
Repack_Run351572_StreamPhysics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426

The first two repack workflows were deleted and archived, but the third one is still persisting in WMBS.

!!! NOTE: !!!
We must stress here that this change would not prevent the workflow from being archived and CouchDB cleaned once the archiveDelayHours has passed for the workflow in question. And this is because in the couchDB cleaning process no one cares about deletable workflows. Here: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/TaskArchiver/CleanCouchPoller.py#L224-L256 We simply never call the database to check if the workflow is still having any dependent one (a child workflow) still into the system. But is only the archiveDelayHours value which is taken into consideration. So if the agent has been configured in such a way that not enough time has been left for the whole chain of workflow to be completed before archival, the monitoring data from CouchDB and Wmstats is about to vanish no matter what. The current change actually prevents the workflow from being purged from WMBS before all its blocks has been deleted. So all of the Repack workflows did vanish from T0 wmstats but Repack_Run351572_StreamPhysics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426 persisted in WMBS because it is protected by its dependent workflows. And all of it's blocks were also visibly not deleted. Again just as expected.

Upon restarting the ProptReco workflows the picture from the DAO output has changed to:

[{'blocksDeleted': [],
  'blocksNotDeleted': ['/HLTPhysics/Tier0_REPLAY_2022-TkAlMinBias-PromptReco-v425/ALCARECO#8031d7fa-4150-4fa2-b97d-6d0ed901ec1b',
                       '/HLTPhysics/Tier0_REPLAY_2022-LogError-PromptReco-v425/RAW-RECO#a1137639-94b6-46b0-8c8b-ba56ad23bdb5',
                       '/HLTPhysics/Tier0_REPLAY_2022-PromptReco-v425/AOD#8a5a843f-ffe2-4f6e-9c22-a82675582eb5',
                       '/HLTPhysics/Tier0_REPLAY_2022-PromptReco-v425/MINIAOD#52293b0b-8e79-46d6-ad79-2ace07bd1a7b',
                       '/HLTPhysics/Tier0_REPLAY_2022-LogErrorMonitor-PromptReco-v425/USER#55b17a92-f959-489c-9928-676fbebbff7b',
                       '/HLTPhysics/Tier0_REPLAY_2022-PromptReco-v425/DQMIO#471a8230-e726-47ca-ba02-a10808bf13d2'],
  'workflowName': 'PromptReco_Run351572_HLTPhysics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430'},
 {'blocksDeleted': [],
  'blocksNotDeleted': ['/MinimumBias/Tier0_REPLAY_2022-PromptReco-v425/DQMIO#3d481b60-cc2e-4e2d-acd1-d9a09eb1d3f5',
                       '/MinimumBias/Tier0_REPLAY_2022-HcalCalHBHEMuonFilter-PromptReco-v425/ALCARECO#f8240c68-be17-43d3-9729-fbc6e16aa8ab',
                       '/MinimumBias/Tier0_REPLAY_2022-HcalCalIterativePhiSym-PromptReco-v425/ALCARECO#24cc827d-83c3-44e6-92bd-b65fa0445135',
                       '/MinimumBias/Tier0_REPLAY_2022-TkAlMinBias-PromptReco-v425/ALCARECO#0faa9978-4b31-43f1-b98f-06334f8c0c14',
                       '/MinimumBias/Tier0_REPLAY_2022-PromptReco-v425/AOD#7973d5a0-55a1-4674-bab6-207254616c6c',
                       '/MinimumBias/Tier0_REPLAY_2022-SiStripCalZeroBias-PromptReco-v425/ALCARECO#0b7f6632-ccef-4f41-82b4-9e260bdeca53',
                       '/MinimumBias/Tier0_REPLAY_2022-PromptReco-v425/MINIAOD#0f9e425d-3e98-4041-a4aa-37adc9db8f34',
                       '/MinimumBias/Tier0_REPLAY_2022-SiStripCalMinBias-PromptReco-v425/ALCARECO#40f7e18b-b6f0-4165-b38e-8565eaa3c368',
                       '/MinimumBias/Tier0_REPLAY_2022-HcalCalIsoTrkFilter-PromptReco-v425/ALCARECO#360deaee-adaa-4924-92f9-1ef13c45901a',
                       '/MinimumBias/Tier0_REPLAY_2022-HcalCalHO-PromptReco-v425/ALCARECO#127815b3-352d-4e7b-8229-e6eb62825491'],
  'workflowName': 'PromptReco_Run351572_MinimumBias_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430'},
 {'blocksDeleted': [],
  'blocksNotDeleted': ['/NoBPTX/Tier0_REPLAY_2022-v425/RAW#4d343363-8dca-48ef-934f-8f85f7658a35',
                       '/HLTPhysics/Tier0_REPLAY_2022-v425/RAW#edde056d-4050-41a8-a02d-c6d6f827d9b2',
                       '/Cosmics/Tier0_REPLAY_2022-v425/RAW#e8bec12f-3f00-44a5-ae9e-b666a7ace691',
                       '/HcalNZS/Tier0_REPLAY_2022-v425/RAW#0723e46e-5eb2-4605-96c5-e8128135acef',
                       '/MinimumBias/Tier0_REPLAY_2022-v425/RAW#e9167ce5-94c3-48be-87eb-48785907df9f'],
  'workflowName': 'Repack_Run351572_StreamPhysics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426'},
 {'blocksDeleted': [],
  'blocksNotDeleted': ['/HcalNZS/Tier0_REPLAY_2022-LogError-PromptReco-v425/RAW-RECO#a1b84aa7-b529-480f-9c99-52cca5e6a97d',
                       '/HcalNZS/Tier0_REPLAY_2022-PromptReco-v425/DQMIO#25170a25-34ee-4804-814f-497edf41e8ed',
                       '/HcalNZS/Tier0_REPLAY_2022-PromptReco-v425/MINIAOD#93ae5e7c-8cd9-4325-9319-a423695a1b06',
                       '/HcalNZS/Tier0_REPLAY_2022-LogErrorMonitor-PromptReco-v425/USER#83095a20-ff9b-4e6f-8e6a-dbefd4e6830c',
                       '/HcalNZS/Tier0_REPLAY_2022-PromptReco-v425/AOD#6d532fb2-72f9-46c3-963f-7589ffba540d',
                       '/HcalNZS/Tier0_REPLAY_2022-HcalCalMinBias-PromptReco-v425/ALCARECO#a545e06b-f543-4e10-aaf4-ed3bb5ac5a84'],
  'workflowName': 'PromptReco_Run351572_HcalNZS_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430'}]

So what happens at this stage is that all the PromptReco workflows are Completed but they are hold in the system because of the blockDeletionDelayHours for their output blocks has not been exceeded yet. Here is the output from the sql query for finding all deletable workflows:

SQL> SELECT DISTINCT
  2      wmbs_workflow.name
  3      FROM wmbs_subscription
  4  INNER JOIN wmbs_workflow ON
  5             wmbs_workflow.id = wmbs_subscription.workflow
  6  INNER JOIN (
  7      SELECT name FROM wmbs_workflow
  8      WHERE name NOT IN (SELECT DISTINCT ww.name FROM wmbs_workflow ww
  9                         INNER JOIN wmbs_subscription ws ON
 10                             ws.workflow = ww.id
 11                         WHERE ws.finished=0)
 12      ) complete_workflow ON
 13      complete_workflow.name = wmbs_workflow.name
 14  WHERE wmbs_workflow.name NOT IN (
 15      SELECT DISTINCT ww.name FROM wmbs_workflow ww
 16      INNER JOIN wmbs_subscription ws ON
 17          ws.workflow = ww.id
 18      INNER JOIN wmbs_fileset wfs ON
 19          wfs.id = ws.fileset
 20      INNER JOIN wmbs_fileset_files wfsf ON
 21          wfsf.fileset = wfs.id
 22      INNER JOIN wmbs_file_parent wfp ON
 23          wfp.parent = wfsf.fileid
 24      INNER JOIN wmbs_fileset_files child_fileset ON
 25          child_fileset.fileid = wfp.child
 26      INNER JOIN wmbs_subscription child_subscription ON
 27          child_subscription.fileset = child_fileset.fileset
 28      WHERE child_subscription.finished = 0
 29      );



NAME
----------------------------------------------------------------------------------------------------------------------------------
PromptReco_Run351572_HLTPhysics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430
PromptReco_Run351572_NoBPTX_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430
PromptReco_Run351572_HcalNZS_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430
PromptReco_Run351572_MinimumBias_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430

So, as can be seen, those are already considered as deletable, while the parent Repack is missing. Just as expected.

And here is the log message from TaskArchiver telling us that those workflows are going to be skipped because they are having blocks yet to be deleted:

2022-06-01 17:53:36,569:140704658020096:INFO:CleanCouchPoller:Cleaning up wmbs and disk
2022-06-01 17:53:36,588:140704658020096:DEBUG:CleanCouchPoller:Removing workflow: PromptReco_Run351572_NoBPTX_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430 from the list of deletable workflows. It still has blocks NOT deleted.
2022-06-01 17:53:36,588:140704658020096:DEBUG:CleanCouchPoller:Removing workflow: PromptReco_Run351572_HcalNZS_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430 from the list of deletable workflows. It still has blocks NOT deleted.
2022-06-01 17:53:36,588:140704658020096:DEBUG:CleanCouchPoller:Removing workflow: PromptReco_Run351572_MinimumBias_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430 from the list of deletable workflows. It still has blocks NOT deleted.
2022-06-01 17:53:36,588:140704658020096:DEBUG:CleanCouchPoller:Removing workflow: PromptReco_Run351572_HLTPhysics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430 from the list of deletable workflows. It still has blocks NOT deleted.

What we are about to do now is to decrease the blockDeletetionDelayHours to a value bellow the current lifetime of all the blocks. What we expect to happen is:

  • Once the blocks get deleted the PromptReco Workflows will be purged and
  • And Repack_Run351572_StreamPhysics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426 will show up in the query for deletable workflows
  • Since the blockDeletionDelayHours is already short enough so that even the PromptRecos' timeout has been satisfied, it will be short enough to satisfy also the Repack's timeout.
  • The Repack will get cleaned immediately once it's blocks get deleted by RucioIjector

FYI: @amaltaro @klannon @vkuznet @drkovalskyi @germanfgv @khurtado

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todor, please see some comments in the code.

Doing some grep in the repository, I do not see any mention of dbsbuffer in the WMBS factory, so I'd rather have this new DAO under the DBS3Buffer package. I also have some strong suggestion of improvements to the DAO.

Valentin has a very good point on how we deal with the relational database data, but I'd rather track/address that in a different issue, in the future when it comes up to our priority list.

@todor-ivanov
Copy link
Contributor Author

Thanks @amaltaro and @vkuznet for your reviews. I am alreasdy working on some of your comments.

Doing some grep in the repository, I do not see any mention of dbsbuffer in the WMBS factory, so I'd rather have this new DAO under the DBS3Buffer package.

I did hesitate a lot to put the DAO in DBS3Buffer package, but then I had some second thoughts. Because this DAO is having the workflow names as main selector for the aggregations and is basically taking into account if a workflow is or is not present in wmbs which is a really important detail in this DAO. So at the end I've put it in WMBS.

I also have some strong suggestion of improvements to the DAO.

So for this I know very well what you were talking about. When working on it I did go through the same path you already suggest and I was aware of the price we need to pay for making part of the aggregations in python. But for that I tried to make a more detailed explanation in reply to your inline comment. Please take a look and lets have some discussion if you want offline so we decide how to proceed. Both ways are fine with me, but one of them is much more difficult to convince myself is giving proper results and trust it.

@amaltaro
Copy link
Contributor

amaltaro commented Jun 2, 2022

@todor-ivanov I think this query will give the results that you need:

SELECT dbsbuffer_workflow.name, dbsbuffer_block.deleted, COUNT(dbsbuffer_block.blockname)
FROM dbsbuffer_block
INNER JOIN dbsbuffer_file ON
     dbsbuffer_file.block_id = dbsbuffer_block.id
INNER JOIN dbsbuffer_workflow ON
     dbsbuffer_workflow.id = dbsbuffer_file.workflow
WHERE dbsbuffer_block.deleted=0
GROUP BY dbsbuffer_workflow.name, dbsbuffer_block.deleted;

If the query providing blocknames is going to be used mostly for debugging, then I'd rather not use that in production (which does this same query every couple of minutes). So, feel free to provide it in this PR as well such that T0 team can run it if debugging is needed.

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Jun 3, 2022

Thanks @amaltaro for looking again into this.

@todor-ivanov I think this query will give the results that you need:

SELECT dbsbuffer_workflow.name, dbsbuffer_block.deleted, COUNT(dbsbuffer_block.blockname)
FROM dbsbuffer_block
INNER JOIN dbsbuffer_file ON
dbsbuffer_file.block_id = dbsbuffer_block.id
INNER JOIN dbsbuffer_workflow ON
dbsbuffer_workflow.id = dbsbuffer_file.workflow
WHERE dbsbuffer_block.deleted=0
GROUP BY dbsbuffer_workflow.name, dbsbuffer_block.deleted;

Sorry Alan, but I think what we would expect from these counts is not what would be counted in reality. The fact that we do not GROUP BY blockname only masks some important details. To prove myself correct here, lets just change the selection requirement to simply WHERE dbsbuffer_block.deleted=1 so that we count the blocks wich are already deleted.

SQL> SELECT count(dbsbuffer_block.blockname), dbsbuffer_workflow.name, dbsbuffer_block.deleted
  2  FROM dbsbuffer_block
  3  INNER JOIN dbsbuffer_file ON
  4       dbsbuffer_file.block_id = dbsbuffer_block.id
  5  INNER JOIN dbsbuffer_workflow ON
  6       dbsbuffer_workflow.id = dbsbuffer_file.workflow
  7  WHERE dbsbuffer_block.deleted=1
  8  GROUP BY dbsbuffer_workflow.name, dbsbuffer_block.deleted;


COUNT(DBSBUFFER_BLOCK.BLOCKNAME) NAME													 DELETED
-------------------------------- ---------------------------------------------------------------------------------------------------- ----------
			       4 Express_Run351572_StreamCalibration_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426			       1
			       1 Repack_Run351572_StreamNanoDST_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426			       1
			       1 Repack_Run351572_StreamRPCMON_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426			       1
			      36 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1

Look at:

 36 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1

I would not expect this Express workflow to have 36 blocks. So lets then group by blockname:

SQL> SELECT count(dbsbuffer_block.blockname), dbsbuffer_workflow.name, dbsbuffer_block.deleted, dbsbuffer_block.blockname
  2  FROM dbsbuffer_block
  3  INNER JOIN dbsbuffer_file ON
  4       dbsbuffer_file.block_id = dbsbuffer_block.id
  5  INNER JOIN dbsbuffer_workflow ON
  6       dbsbuffer_workflow.id = dbsbuffer_file.workflow
  7  WHERE dbsbuffer_block.deleted=1
  8  GROUP BY dbsbuffer_workflow.name, dbsbuffer_block.deleted, dbsbuffer_block.blockname;


COUNT(DBSBUFFER_BLOCK.BLOCKNAME) NAME													 DELETED BLOCKNAME
-------------------------------- ---------------------------------------------------------------------------------------------------- ---------- ----------------------------------------------------------------------------------------------------
			       1 Repack_Run351572_StreamRPCMON_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426			       1 /RPCMonitor/Tier0_REPLAY_2022-v425/RAW#fbc62f6f-45c6-430d-a555-66950179e1b0
			       5 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /StreamExpressCosmics/Tier0_REPLAY_2022-TkAlCosmics0T-Express-v425/ALCARECO#34693f19-334f-45e3-9bcc-
			       1 Express_Run351572_StreamCalibration_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426			       1 /StreamCalibration/Tier0_REPLAY_2022-PromptCalibProdEcalPedestals-Express-v425/ALCAPROMPT#f3e73f7a-b
			       1 Repack_Run351572_StreamNanoDST_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426			       1 /L1Accept/Tier0_REPLAY_2022-v425/RAW#976c74e9-873c-40fc-9d98-66f4d173bdd5
			       5 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /StreamExpressCosmics/Tier0_REPLAY_2022-SiStripPCLHistos-Express-v425/ALCARECO#1a539e31-5b51-4bba-ad
			       1 Express_Run351572_StreamCalibration_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426			       1 /StreamCalibration/Tier0_REPLAY_2022-EcalTestPulsesRaw-Express-v425/ALCARECO#0afc42d0-09b9-4675-a9d7
			       1 Express_Run351572_StreamCalibration_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426			       1 /StreamCalibration/Tier0_REPLAY_2022-Express-v425/DQMIO#04991f93-9f8e-41b7-82b7-bbefbb4c1322
			       1 Express_Run351572_StreamCalibration_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426			       1 /TestEnablesEcalHcal/Tier0_REPLAY_2022-Express-v425/RAW#8f57d4e5-1a6c-48ae-9fd4-5633687a23ab
			       6 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /ExpressCosmics/Tier0_REPLAY_2022-Express-v425/FEVT#e1659fff-a2be-49ca-aafc-21fe4f80664d
			       5 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /StreamExpressCosmics/Tier0_REPLAY_2022-SiPixelCalZeroBias-Express-v425/ALCARECO#9d9c5727-436f-4007-
			       5 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /StreamExpressCosmics/Tier0_REPLAY_2022-SiStripCalZeroBias-Express-v425/ALCARECO#4cb4a954-a96c-43b8-
			       5 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /StreamExpressCosmics/Tier0_REPLAY_2022-PromptCalibProdSiStrip-Express-v425/ALCAPROMPT#ae2bc211-4092
			       5 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /StreamExpressCosmics/Tier0_REPLAY_2022-Express-v425/DQMIO#d2819fcf-bac2-4fbb-80cf-11b5bc9eeb00

Looking at:

			       5 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /StreamExpressCosmics/Tier0_REPLAY_2022-TkAlCosmics0T-Express-v425/ALCARECO#34693f19-334f-45e3-9bcc-
...
			       5 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /StreamExpressCosmics/Tier0_REPLAY_2022-SiStripPCLHistos-Express-v425/ALCARECO#1a539e31-5b51-4bba-ad
...
			       6 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /ExpressCosmics/Tier0_REPLAY_2022-Express-v425/FEVT#e1659fff-a2be-49ca-aafc-21fe4f80664d
			       5 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /StreamExpressCosmics/Tier0_REPLAY_2022-SiPixelCalZeroBias-Express-v425/ALCARECO#9d9c5727-436f-4007-
			       5 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /StreamExpressCosmics/Tier0_REPLAY_2022-SiStripCalZeroBias-Express-v425/ALCARECO#4cb4a954-a96c-43b8-
			       5 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /StreamExpressCosmics/Tier0_REPLAY_2022-PromptCalibProdSiStrip-Express-v425/ALCAPROMPT#ae2bc211-4092
			       5 Express_Run351572_StreamExpressCosmics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426		       1 /StreamExpressCosmics/Tier0_REPLAY_2022-Express-v425/DQMIO#d2819fcf-bac2-4fbb-80cf-11b5bc9eeb00

This clearly shows now the actual number of blocks for this Express is 7. And some of those entries simply get into the result more than once. And I do not expect a single block to be deleted 6 or 5 times, this is because of the number of files in the blocks and because we do go through a JOIN with dbsbuffer_files table.

Now lets go further and then add the JOIN with wmbs_workflow table, which is really needed because we do want to iterate only through blocks related to workflows which are still into the system instead of every single block that has been ever created by the agent.

SQL> SELECT  COUNT(dbsbuffer_block.blockname), dbsbuffer_workflow.name, dbsbuffer_block.deleted
  2  FROM dbsbuffer_block
  3  INNER JOIN dbsbuffer_file ON
  4       dbsbuffer_file.block_id = dbsbuffer_block.id
  5  INNER JOIN dbsbuffer_workflow ON
  6       dbsbuffer_workflow.id = dbsbuffer_file.workflow
  7  INNER JOIN wmbs_workflow ON
  8       dbsbuffer_workflow.name = wmbs_workflow.name
  9  WHERE dbsbuffer_block.deleted=0
 10  GROUP BY dbsbuffer_workflow.name, dbsbuffer_block.deleted;



COUNT(DBSBUFFER_BLOCK.BLOCKNAME) NAME													 DELETED
-------------------------------- ---------------------------------------------------------------------------------------------------- ----------
			     150 PromptReco_Run351572_HLTPhysics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430			       0
			     248 PromptReco_Run351572_NoBPTX_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430				       0
			     150 PromptReco_Run351572_HcalNZS_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430 			       0
			      85 Repack_Run351572_StreamPhysics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426			       0
			     370 PromptReco_Run351572_MinimumBias_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430			       0

I still think the only way of doing this is through nested select statements as I tested in my previous comment. I know the query looks much more cumbersome and heavy, but at least gives the results that we expect.

@todor-ivanov
Copy link
Contributor Author

After a long private discussion we were having with @amaltaro about these queries, we did agree on minimizing them to the last bit possible. Which means:

  • We will drop the INNER JOIN with wmbs_workflow table and will bite the bullet of iterating through the very long list of workflows which is preserved in dbsbuffer even after the workflows have been archived and WMBS cleaned. (Still to be checked though if, whehre and when dbsbuffer_block table is cleaned)
  • We will not query for workflows with both deleted and undeleted blocks, we will only query for blocks with undeleted blocks. This would help us avoid the need of repeating the same sql query with only one requirement changed (from DELETED = 0 to DELETED = 1) in the DAO and doing LEFT OUTER JOIN on the result of them both. Which would be quite expensive query.
  • We will simply count only distinct names in the count statement and ignore repetitions in the resultant query.

So the final one we agreed upon is:

SQL> SELECT dbsbuffer_workflow.name, dbsbuffer_block.deleted, COUNT(DISTINCT dbsbuffer_block.blockname)
  2  FROM dbsbuffer_block
  3  INNER JOIN dbsbuffer_file ON
  4       dbsbuffer_file.block_id = dbsbuffer_block.id
  5  INNER JOIN dbsbuffer_workflow ON
  6       dbsbuffer_workflow.id = dbsbuffer_file.workflow
  7  WHERE dbsbuffer_block.deleted=0
  8  GROUP BY dbsbuffer_workflow.name, dbsbuffer_block.deleted
  9  ;


NAME													DELETED COUNT(DISTINCTDBSBUFFER_BLOCK.BLOCKNAME)
---------------------------------------------------------------------------------------------------- ---------- ----------------------------------------
PromptReco_Run351572_HcalNZS_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430				      0 				       6
PromptReco_Run351572_NoBPTX_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430				      0 				       8
PromptReco_Run351572_HLTPhysics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430			      0 				       6
PromptReco_Run351572_MinimumBias_Tier0_REPLAY_2022_ID220531142559_v425_220531_1430			      0 				      10
Repack_Run351572_StreamPhysics_Tier0_REPLAY_2022_ID220531142559_v425_220531_1426			      0 				       5

It was crosschecked by temprarily implementing the inner join with the wmbs_workflow table and then comparing the result with the list of blocks returned by the currently implemented GetDeletedBocksByWorkflow DAO, which we plan to keep for debuggin purposes only, because it does serve a as a good tool for visualizing the actual lists of workflows and blocks known to WMBS. I am working on the new DAO now, which should be called CountUndeletedBlocksByWorkflow

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 20 warnings and errors that must be fixed
    • 15 warnings
    • 97 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 37 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13275/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 20 warnings and errors that must be fixed
    • 15 warnings
    • 97 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 37 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13276/artifact/artifacts/PullRequestReport.html

@todor-ivanov todor-ivanov requested review from vkuznet and amaltaro June 4, 2022 09:36
@todor-ivanov
Copy link
Contributor Author

Thanks for your reviews @vkuznet @amaltaro I did try to simplify and make the code change in this PR according to all comments and offline discussions we had. So please feel free to take a look at the latest version again at your convenience.
I am currently working on the few minor pylint changes. Which i intentionally kept for last because otherwise I would have had troubles merging and squashing.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 20 warnings and errors that must be fixed
    • 15 warnings
    • 97 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 37 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13277/artifact/artifacts/PullRequestReport.html

@todor-ivanov todor-ivanov force-pushed the bugfix_T0_RaceConditionArchiveDelay_fix-11154 branch from e96364a to 3f7576e Compare June 6, 2022 08:14
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 22 new failures
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 20 warnings and errors that must be fixed
    • 15 warnings
    • 97 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 37 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13278/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 22 new failures
  • Python3 Pylint check: succeeded
    • 15 warnings
    • 97 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 30 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13279/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor Author

Test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 22 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 15 warnings
    • 97 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 30 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13280/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@vkuznet vkuznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@todor-ivanov , could you please resolve all conversation which are done to simplify review process

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@todor-ivanov those unit tests are likely failing due to a problem/intervention on the Rucio integration server, but please talk to Eric to see whether we can have those succeeding before this PR gets merged.

I also left a few comments along the code for your consideration.

@@ -380,6 +396,16 @@ def deleteWorkflowFromWMBSAndDisk(self):
deletableWorkflowsDAO = self.daoFactory(classname="Workflow.GetDeletableWorkflows")
deletablewfs = deletableWorkflowsDAO.execute()

# For T0 subtract the workflows which are not having all their blocks deleted yet:
if not self.useReqMgrForCompletionCheck:
undeletedBlocksByWorkflowDAO = self.dbsDaoFactory(classname="CountUndeletedBlocksByWorkflow")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to make it an object instance and initialize it in the setup method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see a reason why to spawn these changes in many places, since the rule of thumb for the whole CleanCouchPoller module (at least as far as I notice) is:

  • create the DAO instance where (in the method) it is needed
  • execute the DAO in place (right after instantiating it)

I know in other modules the practice is different. But we would not gain in terms of CPU or anything if we move that in the object instance. And this DAO is about to be used here and here only. While keeping the whole thing well grouped in one place seems to me more logical and easy to link which is what, and where it is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does take a few extra CPU cycles to try to instantiate it every cycle, but I have strong preference here.

src/python/WMComponent/TaskArchiver/CleanCouchPoller.py Outdated Show resolved Hide resolved
@todor-ivanov
Copy link
Contributor Author

Thanks @amaltaro and @vkuznet for your reviews. I did address all of your comments. Please take another look at your convenience.

@todor-ivanov todor-ivanov requested review from amaltaro and vkuznet June 7, 2022 11:36
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 15 warnings
    • 92 comments to review
  • Pylint py3k check: failed
    • 1 warnings
  • Pycodestyle check: succeeded
    • 30 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13282/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, Todor. Can you please squash these commits accordingly?

@@ -380,6 +396,16 @@ def deleteWorkflowFromWMBSAndDisk(self):
deletableWorkflowsDAO = self.daoFactory(classname="Workflow.GetDeletableWorkflows")
deletablewfs = deletableWorkflowsDAO.execute()

# For T0 subtract the workflows which are not having all their blocks deleted yet:
if not self.useReqMgrForCompletionCheck:
undeletedBlocksByWorkflowDAO = self.dbsDaoFactory(classname="CountUndeletedBlocksByWorkflow")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does take a few extra CPU cycles to try to instantiate it every cycle, but I have strong preference here.

@todor-ivanov
Copy link
Contributor Author

Thanks @amaltaro I am squashing it now.
Just and update - we did rerun another replay with @germanfgv this morning with the configuration combination that could lead to the race condition and the full sequence of events in the agent was closely monitored and checked + we did exercise the decreasing of the blockDeletionDelayHours bellow the current workflows lifetime and watched all workflows being cleaned from wmbs tables.

…locks not yet deleted.

Add GetDeletedBlocksByWorkflow DAO to WMBS.

Aggregate all results in the DAO per workflowName

Add extra protection for T0 at cleanCouchPoller

Typo

Update docstrings and log messages

Remove redundant statements:

Remove redundant range() start argument

Remove redundant pass statement

Remove redundant DISTINCT statement

Typo

Add CountUndeletedBlocksByWorkflow DAO && Decrease execution complexity in workflows with undeleted blocks check.

Change log level to info.

remove keynames remapping from GetDeletedBlocksByWorkflow DAO

Pylint fixes.

Review fixes
@todor-ivanov todor-ivanov force-pushed the bugfix_T0_RaceConditionArchiveDelay_fix-11154 branch from df62783 to 71d225c Compare June 8, 2022 13:59
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 15 warnings
    • 92 comments to review
  • Pylint py3k check: failed
    • 1 warnings
  • Pycodestyle check: succeeded
    • 30 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13292/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

amaltaro commented Jun 8, 2022

Thanks Todor. Whenever it's ready, please do remove the "Do not merge" tag. I know it's hence, I removed it myself.
I am merging it now because it only touches WMAgent code, otherwise we would have to wait for tomorrow's production upgrade and final tag.

@amaltaro amaltaro merged commit 2db755c into dmwm:master Jun 8, 2022
@todor-ivanov
Copy link
Contributor Author

thanks @amaltaro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

T0: Race condition with misaligned configuration values for archiveDelayHours vs. blockDeletionDelayHours
4 participants