Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide uniform way to handle DBS server errors #11176

Merged
merged 3 commits into from
Jun 21, 2022

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Jun 9, 2022

Fixes #11167

Status

in development

Description

Provide parser function for DBS Go-based exception. If it successfully parsed the exception message it will return concise message (DBS Go server reason), otherwise it returns original exception message

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

<If it's a follow up work; or porting a fix from a different branch, please mention them here.>

External dependencies / deployment changes

<Does it require deployment changes? Does it rely on third-party libraries?>

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 9, 2022

@amaltaro in my spare time I provide a fix for #11167 in this PR. I'll commit corresponding unit test in second commit. Meanwhile, please have a look at proposed changes. It is not yet official review but rather a proposal.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 tests no longer failing
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 7 warnings
    • 86 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13295/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 tests no longer failing
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 7 warnings
    • 86 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13296/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 7 warnings
    • 86 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13297/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you are in the right direction. In addition to the comments made along the code, can you please define a meaningful PR title (and commit as well, but that can be done once you squash your src/ code changes).

:return: either (parsed) concise exception message or original exception
"""
try:
data = json.loads(exc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure the dbs3-client returns a json object here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the HTTP error body comes from dbs3-client which gets it from DBS server

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is, when this response (error) is passed back from the client to our application, is it provided as a JSON object or as a python dictionary?

src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 7 warnings
    • 86 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13303/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 7 warnings
    • 86 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13305/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, please find my follow up question along the code. I also left other comments for your consideration. Thanks

:return: either (parsed) concise exception message or original exception
"""
try:
data = json.loads(exc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is, when this response (error) is passed back from the client to our application, is it provided as a JSON object or as a python dictionary?

src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py Outdated Show resolved Hide resolved
@amaltaro
Copy link
Contributor

And please, fix the PR title.

@vkuznet vkuznet changed the title Fix issue #11167 Provide uniform way to handle DBS server errors Jun 14, 2022
@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 14, 2022

Alan, regarding your #11176 (comment) . I don't have control over clients. But the data-type on a wire is bytes, then clients read them as string. The string may have JSON representation and that's why we read it using json.loads which (if you rad Python documentation)
Deserialize ``s`` (a ``str``, ``bytes`` or ``bytearray`` instance containing a JSON document) to a Python object.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 7 warnings
    • 86 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13316/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 7 warnings
    • 86 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13317/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 7 warnings
    • 86 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13318/artifact/artifacts/PullRequestReport.html

@vkuznet vkuznet requested review from amaltaro and todor-ivanov June 14, 2022 17:26
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And it just occurred to me that that line checking for proxy error:
https://github.com/dmwm/WMCore/pull/11176/files#diff-dc5aa561c9a43023153ce0a627108a41a95e7bae0a9b16b474398296f23b92b7R102

is extremely dangerous and we should remove it. Since you are touching this code, could you please remove that elif 'Proxy Error' in exString: as well? Otherwise, I can create another issue and get it fixed as well.

src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 7 warnings
    • 85 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13319/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro removed the DBS label Jun 16, 2022
Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valentin, these changes look good to me and I'd like to include them in the next stable release. Can you please squash those commits accordingly? Thanks

@amaltaro
Copy link
Contributor

BTW, I assume you are no longer working on it. In that case, please remove the "Work in progress" label. @vkuznet

@amaltaro
Copy link
Contributor

Yet another updated: I just backported it to the wmagent branch, in this PR: #11184

If you find further src/* changes to be applied, please separate that in a new commit because I will then have to cherry-pick that as well (but the best would be to avoid any further changes, if possible).

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vkuznet
The code looks good to me.

logging.exception(msg)
logging.debug("block info: %s \n", block)
results.put({'name': name, 'success': "error", 'error': msg})

return


def parseDBSException(exBodyString):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi again @vkuznet. Sorry for reiterating through this upon PR approval from my side, but I started wondering if that function won't be better to be a separate method in a dedicated DBSExeption class. One possible place could be here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/DBS/DBSErrors.py or equivalent one in WMComponent/DBS3Buffer

Just an idea though, not requesting any changes here. If not accepted, the current approach also works perfectly fine according to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@todor-ivanov , if code is independent (as it is now) it is better to keep it in separated function and not part of any class. But I don't have strong opinion in which module it should reside. So far we use it only in DBSUploadPoller and not anywhere else. As such it is better to reside over there and be treated as local function. But if there is use-case to use it in different parts of WMCore then yes may be DBSErrors.py can be a better place to hold this function .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vkuznet
Agreed :) lets move on as it is right now. And if we find a use case we can always move it in the general exception class.

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 16, 2022

@amaltaro squash is done, and I separated code vs unit test commits.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 336 new failures
    • 14 tests deleted
    • 1 tests no longer failing
    • 1 tests added
    • 15 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 7 warnings
    • 85 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13342/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 16, 2022

@amaltaro I'm not sure what happen with jenkins unit tests, all regular tests have failed but the one I added succeed, e.g.

WMComponent_t.DBS3Buffer_t.DBSUpload_t.DBSUploadTest:testParseDBSException was added with status success

I only squashed the changes and I doubt it is an issue. Please guide what to do with this.

@amaltaro
Copy link
Contributor

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 7 warnings
    • 85 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13344/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 16, 2022

@amaltaro I think with new changes other DBS test may need adjustments. I looked up at failed tests test/python/WMComponent_t/DBS3Buffer_t/DBSUpload_t.py line 620 and from the log it says:
root: ERROR: Error trying to process block /Cosmics/TropicalSeason1655397413-v1/RAW#51c2c521-8ee4-442d-8052-5a20e519b853 through DBS. Error: Proxy Error, this is a mock proxy error.
which I think is correct error message since we remove Proxy error if block statement. So my point is that you have at least one unit tests which should be corrected to handle Proxy error or we may need to put back this if block statement since test code relies on it and in that case it return some result.

@amaltaro
Copy link
Contributor

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 4 tests no longer failing
    • 1 tests added
    • 4 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 7 warnings
    • 85 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13350/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

@vkuznet Valentin, given that the block error handling is embedded in the DBS3Upload algorithm execution, we cannot properly parse such injection "errors" in the unit tests.

Given that we consider Proxy Error an error that fails the block injection - meaning, it will be retried in the next cycle - as almost all of the other exceptions, I would suggest to change our emulator to stop causing random proxy errors. What do you think about removing these lines:
https://github.com/dmwm/WMCore/blob/master/src/python/WMQuality/Emulators/DBSClient/DBS3API.py#L56-L59
?

I tested with the wmcore container and it seems to do the trick.

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 17, 2022

Of course we can remove such lines but it means that all unit tests will not test the proxy error. In my view we need to test all possible errors coming from CMSWEB, including proxy error, and the correct approach is to adjust unit tests to not throw the exception but in fact see the proxy error. It is really depends on you view of what unit tests should do. I do not feel that I should make this decision. Please says explicitly which route to proceed.

@amaltaro
Copy link
Contributor

I said it in my previous comment. Having exhaustive tests for every single exception we might get from CMSWEB would be the ideal, but it's not practical.

If you want to make the right thing though, we should have:
a) one test for the normal and successful component cycle
b) one test where a block fails to be injected (with an error that we consider a success request, like block already exists)
c) one test where a block fails to be injected (with a hard exception, making the block injection to be retried in the next cycle)

This is probably much more work than what you have originally planned though, so I am happy with the first suggestion as well (removing those few lines from the emulator).

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 17, 2022

In this case I suggest to open a different ticket with suggestions you and keep it around. In this one, I commented out Proxy Error mocking. Let's see how jenkins tests will come out.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 1 tests added
    • 5 changes in unstable tests
  • Python3 Pylint check: failed
    • 12 warnings and errors that must be fixed
    • 8 warnings
    • 89 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13356/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 18, 2022

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests no longer failing
    • 1 tests added
  • Python3 Pylint check: failed
    • 12 warnings and errors that must be fixed
    • 8 warnings
    • 89 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13357/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 20, 2022

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 12 warnings and errors that must be fixed
    • 8 warnings
    • 89 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13363/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 21, 2022

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 1 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 12 warnings and errors that must be fixed
    • 8 warnings
    • 89 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13365/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

I do not think the testThrottling unit test failure is related to the changes provided in here.

@amaltaro
Copy link
Contributor

test this please

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 21, 2022

@amaltaro something really weird in Jenkins, if you look back to unit tests you'll see that each time I run test this please it is always a new unit test fails. For instance, in https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13357/artifact/artifacts/PullRequestReport.html it is

WMCore_t.Database_t.CMSCouch_t.CMSCouchTest:testRevisionHandling changed from success to failure
WMCore_t.WorkQueue_t.WorkQueue_t.WorkQueueTest:testGlobalDatasetSplitting changed from success to failure

in https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13363/artifact/artifacts/PullRequestReport.html it is

WMCore_t.WMSpec_t.StdSpecs_t.Resubmission_t.ResubmissionTests:testCustomSimpleTaskChainACDC changed from success to error

and, in your test it is testThrottling. It is clear indication that we have issue with instability of unit tests in Jenkins and failed tests are not related to this PR.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 3 tests no longer failing
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 12 warnings and errors that must be fixed
    • 8 warnings
    • 89 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13366/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 tests no longer failing
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 12 warnings and errors that must be fixed
    • 8 warnings
    • 89 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13367/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

Thanks Valentin. The testThrottling seem to be persisting though, but I will get it fixed afterwards.

@amaltaro amaltaro merged commit 1740fb1 into dmwm:master Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better handle for DBS3Upload errors due to "Error: concurrency error"
4 participants