Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trap SSL_connect error and Service Unavailable in DBSUploadPoller #10252

Merged
merged 2 commits into from
Feb 4, 2021

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Feb 3, 2021

Fixes #10000

Status

ready

Description

To avoid having DBS3Upload crashing all the time that there are instabilities with the CMSWEB cluster (like now with the frontends), catch the following 2 exceptions/errors and gracefully skip the component cycle:

  • "Service Temporarily Unavailable"
  • "OpenSSL SSL_connect"

the second error might be too generic, but if we have any serious SSL issues in the agent, everything else should break.

In addition to that, you can see that I removed those 2 functions that were apparently not used anywhere in WMCore nor in T0.

Is it backward compatible (if not, which system it affects?)

yes

Related PRs

none

External dependencies / deployment changes

none

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 1 tests added
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 7 warnings
    • 35 comments to review
  • Pylint py3k check: failed
    • 0 errors and warnings that should be fixed
    • 1 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 17 comments to review
  • Python3 compatibility checks: failed
    • fails python3 compatibility test

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11117/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 1 tests added
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 7 warnings
    • 35 comments to review
  • Pylint py3k check: succeeded
    • 0 errors and warnings that should be fixed
    • 0 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 17 comments to review
  • Python3 compatibility checks: failed
    • fails python3 compatibility test

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11118/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor

Thanks @amaltaro! Is it still in not-tested state or you just forget to update the comment. Asking, because If the later, then I should proceed with the review.

@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 3, 2021

@todor-ivanov it's been tested via unit tests. I'm now running the final test before it gets added to the final new production release.
So, please feel free to review it.

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amaltaro In general the code is having the right structure, but there are two things I am not comfortable with. Please take a look at the comments inline.

break
elif passiveMsg in str(exceptionObj):
break
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are missing indentation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a for-else loop. It looks weird to my eyes, but it's a valid construction. If we complete the for loop without breaking out of it, the else block gets executed (which means, none of our error messages matched the actual http error).

Copy link
Contributor

@todor-ivanov todor-ivanov Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the trend of elif statements from the previous lines, it is not that hard to fool oneself it is a conclusion statement in the above if control structure which is missing indentation. Me, obviously, easily fell into that trap :)

for passiveMsg in passiveErrorMsg:
if passiveMsg in excReason:
break
elif passiveMsg in str(exceptionObj):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matching a string inside the whole (string converted) exception object only to identify the type of Error, seems to me kind of error prone. I am not sure if this is a common practice, though. Mybe it is just me not being familiar with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good practice, but given that we do not have different exception types (or return code) for these exceptions, the only way we can prevent the component from crashing from transient issues in the system, is by parsing the error message returned in the HTTP call.

Copy link
Contributor

@todor-ivanov todor-ivanov Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't all those expected to be visible inside the Exception Reason. And am I correct in the observation that in the current list of errors you consider not only WMCore exceptions but also exceptions from libraries with regular use all over the place like SSL etc. Meaning we are not in full control of where the Error string would appear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception reason contains the string with the error message. IF, our request response comes back with a header.reason attribute:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/pycurl_manager.py#L285

however, I think there could be other problems when executing curl.perform() which would actually raise an exception right there, thus potentially not having the reason attribute.
Reading the traceback I posted in the GH issue, this is exactly what happened two days ago.

excReason = getattr(exceptionObj, 'reason', '')
errorMsg = 'Failed to fetch parentage map from WMStats, skipping this cycle. '
errorMsg += 'Exception: {}. Reason: {}. Error: {}. '.format(type(exceptionObj).__name__,
excReason, str(exceptionObj))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite confident it is a good idea of having this message taken out of DBSUploadPoller and putting it here. I would preserve the messages related to some kind of exceptions inside the object instance of the above class, while this function here sounds to me a decent one to be considered as a generic function for testing passive vs. hard errors coming from a module. I can already foresee a place or two I could benefit from it. Putting this message here constrains its usage to only one particular type of calls and at the same time cuts significant information in the class it was taken from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it more generic, we would have to move it to a different place, and accept the list of errors as a parameter.
Moving the error message back to the class object implementation is feasible, but I'm trying to avoid code duplication; thus just profiting of some exception parsing that is already done in this function. Let me see if it makes sense to change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the list of errors as a parameter is not a bad idea indeed.

@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 4, 2021

@todor-ivanov Todor, can you please have another look? I have pushed commits 3 and 4 with the error message change that you suggested. If you prefer this approach, I will then squash my commits. Thanks

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things look much better now, with the error message back to the place where it belongs.
Thanks @amaltaro, all the rest looks fine to me.

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 1 tests added
  • Pylint check: succeeded
    • 7 warnings
    • 35 comments to review
  • Pylint py3k check: succeeded
    • 0 errors and warnings that should be fixed
    • 0 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 17 comments to review
  • Python3 compatibility checks: failed
    • fails python3 compatibility test

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11121/artifact/artifacts/PullRequestReport.html

fix pylint

Deal with the error message inside the DBSUploadPoller class
update unit tests with error message inside the class
@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 1 tests added
  • Pylint check: succeeded
    • 7 warnings
    • 35 comments to review
  • Pylint py3k check: succeeded
    • 0 errors and warnings that should be fixed
    • 0 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 17 comments to review
  • Python3 compatibility checks: failed
    • fails python3 compatibility test

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11122/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 4, 2021

Tests went fine, even though I don't think we hit any frontend issues and/or instabilities...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DBS3Upload crashes with HTTP Error 503
3 participants