Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix JobAccountant missing parent dbsbuffer file #9997

Merged
merged 2 commits into from
Dec 8, 2020

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Oct 17, 2020

Fixes #9456

Status

In development

Description

From the GH issue, there are cases where an ACDC collection has the incorrect information regarding the parent files.
This PR provides two level of protection (still missing the root cause though, likely on the ErrorHandler component):

  • when inserting files into WMBS (from the ACDC collection), go through all the files listed in the parents parameter, and only add those that will be actually referenced as a parent
  • when searching for parent files in JobAccountant, in addition to checking the merged flag, also check the lfn name

UPDATE:
Second commit will ensure that we only provide parent files that have been merged (thus, only parent files that are needed in the execution of the ACDC workflow, to track the output parentage).

Is it backward compatible (if not, which system it affects?)

yes

Related PRs

none

External dependencies / deployment changes

none

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: succeeded
    • 3 warnings
    • 7 comments to review
  • Pycodestyle check: succeeded
    • 2 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10539/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 3 warnings
    • 7 comments to review
  • Pycodestyle check: succeeded
    • 2 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10540/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 3 warnings
    • 7 comments to review
  • Pycodestyle check: succeeded
    • 2 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10541/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 3 warnings
    • 7 comments to review
  • Pycodestyle check: succeeded
    • 2 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10542/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 3 warnings
    • 7 comments to review
  • Pycodestyle check: succeeded
    • 2 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10543/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 5 warnings
    • 26 comments to review
  • Pycodestyle check: succeeded
    • 4 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10544/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

Current patch resolves this issue at JobAccountant level (extra protection for the parent identification) and WMBSHelper (scanning all the parents list and only adding to WMBS what's really needed.
Time to remove my debugging...

update WMBSHelper logic; add check for store/unmerged in addition to merged

remove debugging statements
@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: succeeded
    • 5 warnings
    • 26 comments to review
  • Pycodestyle check: succeeded
    • 4 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10545/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: succeeded
    • 5 warnings
    • 26 comments to review
  • Pycodestyle check: succeeded
    • 4 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10546/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 5 warnings
    • 29 comments to review
  • Pycodestyle check: succeeded
    • 5 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/10547/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

First commit has been applied to submit7, which is now running without any issues.
Before this patch gets merged into master, it needs to be carefully validated though (normal workflows, acdcs, mysql-backend, oracle-backend).

@amaltaro
Copy link
Contributor Author

Still needs a thorough testing, but perhaps you can already spot any mistake here, Todor.

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@amaltaro
Copy link
Contributor Author

amaltaro commented Dec 3, 2020

This is likely a good candidate to go into the upcoming WMAgent release, such that we can avoid such nasty bugs in the future. Will spend some more time validating it today/tomorrow.

@amaltaro
Copy link
Contributor Author

amaltaro commented Dec 8, 2020

I have ran many tests with this patch and everything seems to be working fine, including the parentage data uploaded to the ACDC server and what gets inserted into DBSBuffer when acquiring ACDC elements.

There is one possible problem though with TaskChains with >= 3 consecutive KeepOutput=False. The problem gets exposed either with or without this patching, thus moving on with this one.

I'm still discussing to see exactly what's the expected behaviour with data parentage for such cases and will open a new GH issue if needed.

@amaltaro amaltaro merged commit b329945 into dmwm:master Dec 8, 2020
@amaltaro
Copy link
Contributor Author

amaltaro commented Dec 8, 2020

Sigh... should have squashed those commits!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JobAccountant crashing with "Column 'parent' cannot be null"
3 participants