-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CouchDB] Bundle of fixes required for migration to CouchDB 3.x #11001
Conversation
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
Just for the record, with these 6 commits (including fix to the replication documents), we are down to:
only 4 new tests failing, when compared against master. |
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
a650c19
to
56a163e
Compare
Jenkins results:
|
With all these changes in the pipeline, I think we reached a point where the initial real integration tests can be performed. I might still make subtle changes here and there, but I think it's good enough for a first review - and I know it won't be easy to review all of this, so I suggest to completely skip the test files. I want to test at least the following two scenarios:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I missed this request in my emails.
Jenkins results:
|
test this please |
Jenkins results:
|
Jenkins results:
|
Jenkins results:
|
This PR had most of the changes required for CouchDB 3.x migration, but many of those developments were offloaded into their own issue/PR. With that said, I think we no longer need 2 approvals for these changes, but I'm happy to read multiple reviews as well. @vkuznet Valentin, I guess the last 4 commits contain changes that you haven't reviewed yet. PS.: Commits are still a mess and I need to properly squash them. However, the initial PR description is up-to-date. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I raised several concerns in the code, but I also would like to add an addition, a very important one, here. I don't know how code will behave if we put couch behind FE, especially if we'll use a specific end-point, e.g. https://cmsweb.cern.ch/couchdb. Alan, it would be nice if you clarify this.
@@ -157,6 +159,8 @@ def checkForCouchError(self, status, reason, data=None, result=None): | |||
raise CouchNotAcceptableError(reason, data, result) | |||
elif status == 409: | |||
raise CouchConflictError(reason, data, result) | |||
elif status == 410: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify a reason why do you wrap HTTP error code into custom Couch ones? According to the full codebase you do not cover all set of HTTP error codes and only handful of them. Moreover, you convert actual HTTP error into custom one. I don't see any reason for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dates before my arrival in WMCore (and likely in CMS).
But I think the idea is to classify http error codes in a specific human-friendly exception name.
We do not need to cover all the http error codes, and for that we have the else
clause covering all the rest.
resp = self.post('/_replicator', data) | ||
# If replication is triggered in unit tests, make sure to give it a few | ||
# seconds before proceeding with tests (new functionality in CouchDB 3.x) | ||
if "unittest" in source or "unittest" in destination: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if call to time.sleep is useful at all since it happens after the response from a server, should you sleep before making a call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually not. Between the creation of the document in CouchDB and its actual scheduling as an active task, there is a gap of a few seconds. That's why we have to sleep for a while when creating replications in our unit tests.
For the record, I tested this with different number of seconds and 5secs is the bare minimum to be >=99% sure that the replication task will exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaltaro , if this is the case the code still requires changes. What if it would be delay for 6 seconds. Instead, please make for loop with configurable parameter to retries number of times, e.g. 3 times, which will query back the CouchDB to see if the process (replication) is finished. Then, record into the log each attempt that will give more information while debugging the issue. For instance, it may be network hick-up, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not adding more complication to this PR by probing CouchDB and checking whether the replication kicked in. I want to have the BARE minimum amount of changes to move to CouchDB 3.x.
Having a parameter for the sleep time is not too bad though. If it requires only a few lines to be changed, I might add it to this PR soon. Thanks for this suggestion.
@@ -21,7 +21,10 @@ def _commonInit(self, couchURL, couchapp): | |||
self.dbName = self.couchDB.name | |||
self.couchServer = CouchServer(self.couchURL) | |||
else: | |||
couchURL = sanitizeURL(couchURL)['url'] | |||
# NOTE: starting in CouchDB 3.x, we need to provide the couch credentials in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how this code should access CouchDB, from the comment I see that it caries about couchdb credentials, while if we put couch behind FE I don't think that upstream code caries about it at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If we are accessing a remote database, then the url
will not have any credentials encoded in it (aka., this sanitization call does not change anything). While if we are accessing CouchDB on localhost (WMAgent use case), then the url likely has credentials (using basic authentication).
# delete the test database | ||
self.server.deleteDatabase(self.testdbname) | ||
|
||
# now delete all the replicator documents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not delete replicator db itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's another option, possibly a cheaper and cleaner option. Let me give this a try in an upcoming commit.
Thanks for the review, Valentin. I answered those comments inline, and also updated the code to actually delete the unit test databases instead of every single replicator document. However, that might be risky in case someone makes a bad mistake, so I added a basic protection for databases hosted in cmsweb. Changes are provided in the last 2 commits.
Requests will be made with x509 certificates, going through the standard CMSWEB authentication process. If the auth is successful, then two new http headers are added to the request sent to the backends, as implemented here: I still need to revisit this deployment change though, separate it in its own PR and make it configurable. |
Jenkins results:
|
Jenkins results:
|
check_name(dbname) | ||
dbname = urllib.parse.quote_plus(dbname) | ||
if "cmsweb" in self.url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaltaro since you are making changes I strongly suggest to avoid hard-coding 'cmsweb' into the code. For example, eventually we would like to move to CouchDB as a Service which we may host on different URL. Therefore, it would be much better if you'll pass this pattern or URL as configuration parameter such that we will not need to change code again in a future when we'll change CouchDB production URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an extra protection to avoid deleting production databases. Since we delete databases from the unit tests, I wanted to make sure a change in those tests (or a mistake) won't cause bigger problems.
I don't see any easy way to make it configurable as well. If you have a better suggestion, please let me know. Otherwise I think this is good enough (w.r.t. to no protection at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaltaro , I'll leave it up to you to decide, my advice still remain, no hard-coded values should be in a code. The pain of code upgrade is much more than justifications. You can easily add additional optional argument to the class, e.g. prohibitUrlPattern=cmsweb
, such that it will allow to configure the CouchServer
class without hard-coded value.
resp = self.post('/_replicator', data) | ||
# If replication is triggered in unit tests, make sure to give it a few | ||
# seconds before proceeding with tests (new functionality in CouchDB 3.x) | ||
if "unittest" in source or "unittest" in destination: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaltaro , if this is the case the code still requires changes. What if it would be delay for 6 seconds. Instead, please make for loop with configurable parameter to retries number of times, e.g. 3 times, which will query back the CouchDB to see if the process (replication) is finished. Then, record into the log each attempt that will give more information while debugging the issue. For instance, it may be network hick-up, etc.
Jenkins results:
|
check_name(dbname) | ||
dbname = urllib.parse.quote_plus(dbname) | ||
if "cmsweb" in self.url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amaltaro , I'll leave it up to you to decide, my advice still remain, no hard-coded values should be in a code. The pain of code upgrade is much more than justifications. You can easily add additional optional argument to the class, e.g. prohibitUrlPattern=cmsweb
, such that it will allow to configure the CouchServer
class without hard-coded value.
@vkuznet thanks for these reviews, Valentin. |
It is not sufficient since you only defined it in ctor, what you need is to change class
|
Jenkins results:
|
and I turns out I didn't misunderstand it. You want to make it completely configurable (which is the best way to deal with it), but for that, we need to make sure that this value gets propagated from the very top of the chain (maybe even from the service configuration). Adding support to it in CMSCouch is only a small fraction of the work to make it configurable. Given that it's a parameter that will change once every many years (5? 10 years?) , I am rolling back my last commit and leaving it with the one line hard-wired value, sorry. |
Use RequestDBWriter with couch credentials Fix replication to use credentials on both source/target New Couch status 410; replace temp view by permanent one Fix RotatingDatabase class in CMSCouch - temp view Keep sanitizing url in Requests class only add replication sleep time for unittest replications do not allow cmsweb couchdb databases to be deleted log replication doc deletion and creation in AgentStatusPoller Create parameter for seconds to sleep after replication is created fix docstring
fix unittest: ContinuousSummaryHistogram_t.py:ContinuousSummaryHistogramTest.testA_BasicTest fix WorkQueue unit tests and CMSCouch fix unit tests for temporary views fix highestJobID view unit tests fix testWMBSInjectionStatus and testSpecialCharacterPasswords unit tests fix WorkQueue_t testWMBSInjectionStatus unit test fix RotatingDatabase testExpire unit test remove unused test replication code in WorkQueue_t unittest more consistent fix unit test checking url sanitization delete unit test database instead of replicator docs unit tests parameter for sleeping time
Jenkins results:
|
Thank you for all the reviews, Valentin. |
After this got merged, I think we have 16 new unit tests failing (even though they were not failing in the CouchDB specific Jenkins test). but I also paste it here in case that gets cleaned up:
it could be that it's just a matter of waiting for the replication to go through. I will check it tomorrow. |
Partially fixes #8853
Status
ready
Description
This PR provides almost all the necessary changes to switch CouchDB from version 1.6.1 to 3.1.2 (in addition to #11011).
Summary of changes are:
_dbs
API)RequestDBWriter
has the correct url+credentials to avoid authz errors410 - Gone
In terms of tests, these are the changes:
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
To go after #11011
External dependencies / deployment changes
cmsdist changes: cms-sw/cmsdist#7218
Deployment changes: dmwm/deployment#1088