-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trap SSL_connect error and Service Unavailable in DBSUploadPoller #10252
Conversation
Jenkins results:
|
Jenkins results:
|
Thanks @amaltaro! Is it still in |
@todor-ivanov it's been tested via unit tests. I'm now running the final test before it gets added to the final new production release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amaltaro In general the code is having the right structure, but there are two things I am not comfortable with. Please take a look at the comments inline.
break | ||
elif passiveMsg in str(exceptionObj): | ||
break | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are missing indentation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a for-else loop. It looks weird to my eyes, but it's a valid construction. If we complete the for loop without breaking out of it, the else block gets executed (which means, none of our error messages matched the actual http error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the trend of elif
statements from the previous lines, it is not that hard to fool oneself it is a conclusion statement in the above if
control structure which is missing indentation. Me, obviously, easily fell into that trap :)
for passiveMsg in passiveErrorMsg: | ||
if passiveMsg in excReason: | ||
break | ||
elif passiveMsg in str(exceptionObj): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching a string inside the whole (string converted) exception object only to identify the type of Error, seems to me kind of error prone. I am not sure if this is a common practice, though. Mybe it is just me not being familiar with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good practice, but given that we do not have different exception types (or return code) for these exceptions, the only way we can prevent the component from crashing from transient issues in the system, is by parsing the error message returned in the HTTP call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't all those expected to be visible inside the Exception Reason. And am I correct in the observation that in the current list of errors you consider not only WMCore exceptions but also exceptions from libraries with regular use all over the place like SSL etc. Meaning we are not in full control of where the Error string would appear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception reason contains the string with the error message. IF, our request response comes back with a header.reason attribute:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/pycurl_manager.py#L285
however, I think there could be other problems when executing curl.perform()
which would actually raise an exception right there, thus potentially not having the reason
attribute.
Reading the traceback I posted in the GH issue, this is exactly what happened two days ago.
excReason = getattr(exceptionObj, 'reason', '') | ||
errorMsg = 'Failed to fetch parentage map from WMStats, skipping this cycle. ' | ||
errorMsg += 'Exception: {}. Reason: {}. Error: {}. '.format(type(exceptionObj).__name__, | ||
excReason, str(exceptionObj)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite confident it is a good idea of having this message taken out of DBSUploadPoller
and putting it here. I would preserve the messages related to some kind of exceptions inside the object instance of the above class, while this function here sounds to me a decent one to be considered as a generic function for testing passive vs. hard errors coming from a module. I can already foresee a place or two I could benefit from it. Putting this message here constrains its usage to only one particular type of calls and at the same time cuts significant information in the class it was taken from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it more generic, we would have to move it to a different place, and accept the list of errors as a parameter.
Moving the error message back to the class object implementation is feasible, but I'm trying to avoid code duplication; thus just profiting of some exception parsing that is already done in this function. Let me see if it makes sense to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the list of errors as a parameter is not a bad idea indeed.
@todor-ivanov Todor, can you please have another look? I have pushed commits 3 and 4 with the error message change that you suggested. If you prefer this approach, I will then squash my commits. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Things look much better now, with the error message back to the place where it belongs.
Thanks @amaltaro, all the rest looks fine to me.
Jenkins results:
|
fix pylint Deal with the error message inside the DBSUploadPoller class
update unit tests with error message inside the class
Jenkins results:
|
Tests went fine, even though I don't think we hit any frontend issues and/or instabilities... |
Fixes #10000
Status
ready
Description
To avoid having DBS3Upload crashing all the time that there are instabilities with the CMSWEB cluster (like now with the frontends), catch the following 2 exceptions/errors and gracefully skip the component cycle:
the second error might be too generic, but if we have any serious SSL issues in the agent, everything else should break.
In addition to that, you can see that I removed those 2 functions that were apparently not used anywhere in WMCore nor in T0.
Is it backward compatible (if not, which system it affects?)
yes
Related PRs
none
External dependencies / deployment changes
none