Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CouchDB] New CouchDB 3.x APIs for tracking replications; fix replication check in AgentStatusPoller #11039

Merged
merged 2 commits into from
May 31, 2022

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented Mar 15, 2022

Fixes #11037

Status

ready
Tested with WMAgents running with both CouchDB versions.

Description

This PR provides the following fixes and/or new features:

  • fix how the replication document is created for local workqueue as source
  • removed logic relying on update_sequence (given that source_seq is no longer an integer in CouchDB 3.x)
  • added a new API for fetching a short summary of CouchDB (from the root endpoint /), including its version
  • when a replication document is created, return the CouchDB response
  • updated the logic to check the state of replication tasks/docs
  • for the CouchMonitor class, the following has been implemented:
    • new getActiveTasks client API to retrieve the active tasks from the _active_tasks CouchDB endpoint
    • new getSchedulerJobs client API to retrieve jobs from the _scheduler/jobs CouchDB endpoint
    • new getSchedulerDocs client API to retrieve documents from the _scheduler/docs CouchDB endpoint
    • made a check for CouchDB and replication status compatible with both CouchDB 1.6 and 3.x

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

Related to #11001, but no longer requiring it.

External dependencies / deployment changes

Changes required for CouchDB 3.x, but it can be merged before that version gets integrated in our stack.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/12870/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro changed the title New CouchDB 3.x APIs for tracking replications; fix replication check in AgentStatusPoller [CouchDB] New CouchDB 3.x APIs for tracking replications; fix replication check in AgentStatusPoller Mar 17, 2022
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 23 warnings and errors that must be fixed
    • 7 warnings
    • 205 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 34 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13253/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 6 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 36 warnings and errors that must be fixed
    • 7 warnings
    • 259 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 164 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13254/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 6 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 35 warnings and errors that must be fixed
    • 7 warnings
    • 256 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 162 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13255/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

Instead of blocking this development while we wait for the final integration of CouchDB 3.x, I decided to make it compatible with both CouchDB versions, such that we can merge it sooner than later.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
  • Python3 Pylint check: failed
    • 36 warnings and errors that must be fixed
    • 6 warnings
    • 253 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 162 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13256/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amaltaro
It looks good to me.

amaltaro added 2 commits May 30, 2022 21:01
… in AgentStatusPoller

Make Couch/replication check compatible between different versions of CouchDB

fix getActiveTasks call

fix isReplicationOK method; evaluate stale replication in the last hour

change replication checks once again
fix unit tests

update unit tests
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 7 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 36 warnings and errors that must be fixed
    • 6 warnings
    • 253 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 162 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13257/artifact/artifacts/PullRequestReport.html

@amaltaro amaltaro merged commit ffca53e into dmwm:master May 31, 2022
@amaltaro
Copy link
Contributor Author

amaltaro commented Jun 8, 2022

Despite testing it a few times. I deployed the latest changes today - with a fresh setup - and noticed that elements were not getting replicated from central workqueue to the agent workqueue_inbox.

I might be wrong, but I think there might be a race condition between deleting the "old" replication documents and creating the new ones, around this code:
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/AgentStatusWatcher/AgentStatusPoller.py#L100-L106

I'm going to provide at least a better logging for this, in the open PR #11001

@vkuznet
Copy link
Contributor

vkuznet commented Jun 8, 2022

@amaltaro , I had a look at https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/AgentStatusWatcher/AgentStatusPoller.py#L100-L106 and indeed you have racing conditions there since your request goes to the queue and you never check if was processed. What you should do is the following:

  • make a call to self.localCouchMonitor.deleteReplicatorDocs()
  • add while loop to check if previous call deleted all documents from replicator DB, quite loop only when all are clean
  • call block from lines 102-106 to start new replication
    Please note, you make request to CouchDB HTTP server, the replicator task will be kicked in but you never check that it is finished. Immediately, you place another another replicator request which can be injected into replicator before it finishes to complete deletion of previous docs.

@amaltaro
Copy link
Contributor Author

amaltaro commented Jun 8, 2022

Thank you for looking into this, Valentin.
After further look, there is no race condition but actually the replication from cmsweb to localhost is failing:

[error] 2022-06-08T14:35:45.666481Z [email protected] <0.27500.0> -------- ChangesReader process died with reason: {changes_reader_died,{timeout,ibrowse_stream_cleanup}}
[error] 2022-06-08T14:35:45.666638Z [email protected] <0.27500.0> -------- Replication `1aacaa5d83c7aaec9343d878bc4c2fac+continuous` (`https://cmsweb-testbed.cern.ch/couchdb/workqueue/` -> `http://localhost:5984/workqueue_inbox/`) failed:
 {changes_reader_died,{timeout,ibrowse_stream_cleanup}}

which causes the replication task to be eventually removed from the active tasks (and retried). I do think the deletion/creation replication task code can be made more robust though.

Anyhow, I need to debug which configuration parameters I changed over the last days/weeks, because this is definitely something that I tested and which was working(!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CouchDB 3.x: AgentStatusWatcher fails to parse document source_seq attr
4 participants