Switch mongo client to connect to a fully defined replicaset. #11360

todor-ivanov · 2022-11-09T14:51:28Z

Status

Ready

Description

With the MongoDBAsAService configuration that we were having so far in production, for the sake of simplicity and stability (not having volatile names of a set of pods to connect to), we were relying on a DNS loadbalancer setup in front of the pods serving the database. That way we were having a single entry point. While being quite comfortable for us, this was causing some additional work and effort for CMSWEB team and we were also having some retried writes when we were to stumble on the situation of not having a session with the replicaset master. This was due to the fact that the loadbalancer in front of the replicaset was choosing randomly the hostname from the backends to connect to, rather than letting the full client/server communication to take place and have the proper discovery of the replica master among all the participating pods in the replicaset. This is easily avoidable if we have a hostname alias set in the pod's setup itslef (as it is now for the test MongoDBAsAService cluster) and have the pods serving the replicaset always been named as e.g. mongodb-test.cern.ch. In that case on our end we may have the mongo client pointing not to a fully constructed and "percent-escaped" MongoDB URI of the kind:

mongodb://UNAME:[email protected]

with a single port, but rather having all hostname:ports tuples listed in the connection configuration and avoiding any urllib escaping for the password and username strings. This makes things much more reliable but was requiring the code changes from this PR on our end.

Is it backward compatible (if not, which system it affects?)

YES
BUT strongly depends on the services_config changes in the prod branch and the production MongoDB cluster setup.

Related PRs

Her to be added the relative services_config PRs:
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/171
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/173
https://gitlab.cern.ch/cmsweb-k8s/services_config/-/merge_requests/172

NOTE:
Similarly, we also need the same changes for MSUnmerged configuration and for all branches prod, preprod, test

External dependencies / deployment changes

NOTE:
We need to test if this connection is properly working with the current production cluster of MongoDBAsAService setup such that we can connect to the cluster skipping the DNS loadbalancer and avoiding any additional change in this setup.

cmsdmwmbot · 2022-11-09T15:03:21Z

Jenkins results:

Python3 Unit tests: failed
- 6 new failures
Python3 Pylint check: failed
- 12 warnings and errors that must be fixed
- 17 warnings
- 149 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13700/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2022-11-09T15:06:37Z

And is a printout from a manual connection to the already up an running MongoDBAsAService cluster in testbed as a proof that we are having the full set of the replilcaset participants visible from the client. And that we are connected to the master.

(WMCore.venv3) [uname@host]$ ipython -i -- /data/tmp/WMCore.venv3/srv/WMCore/bin/adhoc-scripts/mongoInit.py -c /data/tmp/WMCore.venv3/srv/current/config/reqmgr2ms-output/config-output.py

Python 3.6.8 (default, Nov 16 2020, 16:55:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.3 -- An enhanced Interactive Python. Type '?' for help.
2022-11-09 15:54:16,925:INFO:mongoInit:<module>(): Loading configFile: /data/tmp/WMCore.venv3/srv/current/config/reqmgr2ms-output/config-output.py
2022-11-09 15:54:16,962:ERROR:MongoDB:_dbTest(): Missing MongoDB databases: msOutputDBTEST_CLUSTER_NAME
2022-11-09 15:54:16,962:ERROR:MongoDB:_dbConnect(): Could not connect to a missing MongoDB databases: msOutputDBTEST_CLUSTER_NAME


In [1]: mongoClt.address
Out[1]: ('mongodb-test-m5lwazmgk6xe-node-0.cern.ch', 32001)

In [2]: mongoClt.nodes
Out[2]: 
frozenset({('mongodb-test-m5lwazmgk6xe-node-0.cern.ch', 32001),
           ('mongodb-test-m5lwazmgk6xe-node-0.cern.ch', 32002),
           ('mongodb-test-m5lwazmgk6xe-node-0.cern.ch', 32003)})

While the full MongoDB configuration was as follows:

In [5]: mongoDBConfig
Out[5]: 
{'database': 'msOutputDBTEST_CLUSTER_NAME',
 'server': ['mongodb-test.cern.ch:32001',
            'mongodb-test.cern.ch:32002',
            'mongodb-test.cern.ch:32003'],
 'replicaset': 'cmsweb-test',
 'port': 27017,
 'username': '******',
 'password': '******',
 'connect': True,
 'directConnection': False,
 'logger': <Logger __main__ (INFO)>,
 'create': False,
 'mockMongoDB': False}

NOTE:
Notice the difference in the hostnames of the server parameter from the configuration and the actual connection in the sesion of the mongo client which is pointing to the proper pod names.

With this we can basically consider this discussion here: https://its.cern.ch/jira/browse/CMSKUBERNETES-175 to be resolved. The few minor details about creating the proper username for dmwm and distributing the secrrets files we will discuss with CMSWEB Team in a private channel.

FYI: @vkuznet @amaltaro @muhammadimranfarooqi @arooshap @goughes

amaltaro

Todor, I added a few questions and comments along the code, some for my own education.

I just wanted to highlight that the following is still pending (before the final review can be done):

fix the failing unit tests
create all the required configuration changes
make sure the new secrets are also properly created and in place.

src/python/WMCore/MicroService/MSOutput/MSOutput.py

src/python/WMCore/Database/MongoDB.py

todor-ivanov · 2022-11-10T11:12:49Z

Just to put some more details on:

Is it backward compatible (if not, which system it affects?)
YES
BUT strongly depends on the services_config changes in the prod branch and the production MongoDB cluster setup.

Here is what the current production MongoDB server ports setup looks like:

$ nmap -n -p27017,32001,32002,32003 mongodb-cms.cern.ch

Starting Nmap 6.40 ( http://nmap.org ) at 2022-11-10 11:37 CET
Nmap scan report for mongodb-cms.cern.ch (137.138.226.213)
Host is up (0.00041s latency).
PORT      STATE  SERVICE
27017/tcp open   unknown
32001/tcp closed unknown
32002/tcp closed unknown
32003/tcp closed unknown

So a fully defined relicaset with a list of server:port tuples like:

data.mongoDBServer = ['mongodb-cms.cern.ch:32001', 'mongodb-cms.cern.ch:32002','mongodb-cms.cern.ch:32003']

Simply, times out:

(WMCore.venv3) [user@host config]$ ipython -i -- /data/tmp/WMCore.venv3/srv/WMCore/bin/adhoc-scripts/mongoInit.py -c /data/tmp/WMCore.venv3/srv/current/config/reqmgr2ms-output/config-output.py
Python 3.6.8 (default, Nov 16 2020, 16:55:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.3 -- An enhanced Interactive Python. Type '?' for help.
2022-11-10 11:51:04,694:INFO:mongoInit:<module>(): Loading configFile: /data/tmp/WMCore.venv3/srv/current/config/reqmgr2ms-output/config-output.py
2022-11-10 11:51:04,698:INFO:mongoInit:<module>(): Connecting using mongoDBConfig: {'connect': True,
 'create': False,
 'database': 'msOutputDBProd',
 'directConnection': False,
 'logger': <Logger __main__ (INFO)>,
 'mockMongoDB': False,
 'password': '***',
 'port': None,
 'replicaset': 'cms-db',
 'server': ['mongodb-cms.cern.ch:32001',
            'mongodb-cms.cern.ch:32002',
            'mongodb-cms.cern.ch:32003'],
 'username': '***'}
2022-11-10 11:51:34,819:ERROR:MongoDB:__init__(): Could not connect to MongoDB server: ['mongodb-cms.cern.ch:32001', 'mongodb-cms.cern.ch:32002', 'mongodb-cms.cern.ch:32003']. Server not available. 
Giving up Now.
...
ServerSelectionTimeoutError: mongodb-cms.cern.ch:32003: [Errno 111] Connection refused,mongodb-cms.cern.ch:32001: [Errno 111] Connection refused,mongodb-cms.cern.ch:32002: [Errno 111] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id: 636cd798451613ce16ed13ad, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription ('mongodb-cms.cern.ch', 32001) server_type: Unknown, rtt: None, error=AutoReconnect('mongodb-cms.cern.ch:32001: [Errno 111] Connection refused',)>, <ServerDescription ('mongodb-cms.cern.ch', 32002) server_type: Unknown, rtt: None, error=AutoReconnect('mongodb-cms.cern.ch:32002: [Errno 111] Connection refused',)>, <ServerDescription ('mongodb-cms.cern.ch', 32003) server_type: Unknown, rtt: None, error=AutoReconnect('mongodb-cms.cern.ch:32003: [Errno 111] Connection refused',)>]>

While, having it configured with just a single entry point and a port pointing to the defaults of the DNS loadbalancer which sitting in front of the replicaset, works just fine:

(WMCore.venv3) [user@host config]$ ipython -i -- /data/tmp/WMCore.venv3/srv/WMCore/bin/adhoc-scripts/mongoInit.py -c /data/tmp/WMCore.venv3/srv/current/config/reqmgr2ms-output/config-output.py
Python 3.6.8 (default, Nov 16 2020, 16:55:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.3 -- An enhanced Interactive Python. Type '?' for help.
2022-11-10 11:56:42,241:INFO:mongoInit:<module>(): Loading configFile: /data/tmp/WMCore.venv3/srv/current/config/reqmgr2ms-output/config-output.py
2022-11-10 11:56:42,245:INFO:mongoInit:<module>(): Connecting using mongoDBConfig: {'connect': True,
 'create': False,
 'database': 'msOutputDBProd',
 'directConnection': False,
 'logger': <Logger __main__ (INFO)>,
 'mockMongoDB': False,
 'password': '***',
 'port': None,
 'replicaset': 'cms-db',
 'server': ['mongodb-cms.cern.ch:27017'],
 'username': '***'}

In [1]: mongoClt.address
Out[1]: ('mongodb-m7wwpbnzgide-node-0.cern.ch', 32001)

In [2]: mongoClt.nodes
Out[2]: 
frozenset({('mongodb-m7wwpbnzgide-node-0.cern.ch', 32001),
           ('mongodb-m7wwpbnzgide-node-0.cern.ch', 32002),
           ('mongodb-m7wwpbnzgide-node-0.cern.ch', 32003)})

If we need to have it all defined properly, we need to point directly to the pods like:

data.mongoDBServer = ['mongodb-m7wwpbnzgide-node-0.cern.ch:32001', 'mongodb-m7wwpbnzgide-node-0.cern.ch:32002','mongodb-m7wwpbnzgide-node-0.cern.ch:32003']

Which is not optimal. We'd like to have this single server name of mongodb-cms.cern.ch as we have it for the testbed MongoDB cluster.

TODO:
To get rid of the DNS loadbalancer infornt of the pods serving the production MongoDB replicaset. Wit the current change we are now completely ready to do so, and simplify things.

FYI @amaltaro @vkuznet

cmsdmwmbot · 2022-11-10T12:13:32Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 3 changes in unstable tests
Python3 Pylint check: failed
- 12 warnings and errors that must be fixed
- 17 warnings
- 150 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13701/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-11-10T13:24:03Z

Jenkins results:

Python3 Unit tests: failed
- 5 new failures
- 2 changes in unstable tests
Python3 Pylint check: failed
- 9 warnings and errors that must be fixed
- 17 warnings
- 150 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 6 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13702/artifact/artifacts/PullRequestReport.html

vkuznet · 2022-11-10T13:28:59Z

@todor-ivanov , I'm not sure I'm following what you are suggesting. Here is how I see the issue from another end (the point of k8s infrastructure):

# here ist list of nodes
k get nodes
NAME                            STATUS   ROLES    AGE    VERSION
mongodb-m7wwpbnzgide-master-0   Ready    master   357d   v1.21.1
mongodb-m7wwpbnzgide-node-0     Ready    <none>   357d   v1.21.1
mongodb-m7wwpbnzgide-node-1     Ready    <none>   122d   v1.21.1

and you can eaily look-up node IP addresses. The mongodb-cms.cern.ch is resolved as following:

host mongodb-cms.cern.ch
mongodb-cms.cern.ch is an alias for lbaas-6c56b66c-9702-4a5d-ac8c-d4c656e22df3.cern.ch.
lbaas-6c56b66c-9702-4a5d-ac8c-d4c656e22df3.cern.ch has address 137.138.226.213

Its IP address points to external IP within k8s cluster when I look at services:

k get svc
NAME                TYPE           CLUSTER-IP       EXTERNAL-IP       PORT(S)           AGE
kubernetes          ClusterIP      10.254.0.1       <none>            443/TCP           357d
mongodb-0-service   NodePort       10.254.118.154   <none>            27017:32001/TCP   357d
mongodb-1-service   NodePort       10.254.195.224   <none>            27017:32002/TCP   357d
mongodb-2-service   NodePort       10.254.16.37     <none>            27017:32003/TCP   357d
mongodb-lb          LoadBalancer   10.254.143.72    137.138.226.213   27017:32668/TCP   357d

Now, with this information I conclude:

the individual mongodb services are not accessible because they do not have external IP, e.g.

mongodb-0-service   NodePort       10.254.118.154   <none>            27017:32001/TCP   357d

the only external IP address is assigned to load balancer. Therefore, your code correctly timeout when you use mongodb-cms.cern.ch:30001 and other port (except load balander port.
because it is in domain of k8s responsibility to choose a specific minion where service will be hosted you can't rely on minion name to have specific service. In other words, today it can run on node-0, but k8s can re-assign service to node-1 if necessary.
Therefore, from a current setup the only reliable way to access is through load balancer since
- the landb alias mongodb-cms.cern.ch points to load balancer external IP
- it is load balancer who will provide access to specific mongodb replica set whose IPs are hidden within k8s infrastructure

Now, if you want full access to individual replica set nodes, then we need to have a different setup:

register in landb individual minions and assign to them specific names, e.g. mongodb-cms1.cern.ch, mongodb-cms2.cern.ch
keep their service ports open as NodePort
if you'll need external access (outside from CERN network) we'll need to open a firewall to individual minions.

But of course such redesign has its pros/cons (especially I doubt CERN network team will be happy about firewall requests). Said that, I'm trying to understand the actual issue here and why DMWM can't use provided load balancer.

todor-ivanov · 2022-11-10T14:21:48Z

Thanks @vkuznet for looking into this.

the only external IP address is assigned to load balancer. Therefore, your code correctly timeout when you use mongodb-cms.cern.ch:30001 and other port (except load balander port.

This is exactly what I was referring to... I am quite aware what the reason of the timeout is - there is indeed the loadbalancer in front of the pods.

I, on the other hand, am not aware of the specific K8 setup of this cluster. Back in the time IIRC it was because of our request to have a single name as an entry point that the loadbalancer was put inforont of the cluster. Now we can indeed connect to the pods directly - they are visible (I do not know how) from inside the CERN network.

(WMCore.venv3) [user@host WMCore]$ ping  mongodb-m7wwpbnzgide-node-0.cern.ch
PING mongodb-m7wwpbnzgide-node-0.cern.ch (188.185.90.95) 56(84) bytes of data.
64 bytes from mongodb-m7wwpbnzgide-node-0.cern.ch (188.185.90.95): icmp_seq=1 ttl=63 time=0.224 ms
...

(WMCore.venv3) [user@host config]$ ipython -i -- /data/tmp/WMCore.venv3/srv/WMCore/bin/adhoc-scripts/mongoInit.py -c /data/tmp/WMCore.venv3/srv/current/config/reqmgr2ms-output/config-output.py
Python 3.6.8 (default, Nov 16 2020, 16:55:22) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.3 -- An enhanced Interactive Python. Type '?' for help.
2022-11-10 14:40:44,534:INFO:mongoInit:<module>(): Loading configFile: /data/tmp/WMCore.venv3/srv/current/config/reqmgr2ms-output/config-output.py
2022-11-10 14:40:44,538:INFO:mongoInit:<module>(): Connecting to MongoDB using the following mongoDBConfig:
{'connect': True,
 'create': False,
 'database': 'msOutputDBProd',
 'directConnection': False,
 'logger': <Logger __main__ (INFO)>,
 'mockMongoDB': False,
 'password': '****',
 'port': None,
 'replicaset': 'cms-db',
 'server': ['mongodb-m7wwpbnzgide-node-0.cern.ch:32001',
            'mongodb-m7wwpbnzgide-node-0.cern.ch:32002',
            'mongodb-m7wwpbnzgide-node-0.cern.ch:32003'],
 'username': '****'}
...
n [3]: mongoClt.admin.command('ping')
Out[3]: 
{'ok': 1.0,
 '$clusterTime': {'clusterTime': Timestamp(1668089416, 1),
  'signature': {'hash': b'\xee\xdb\xc9\xbe\xa5O$\x8dX\xcb\x18\x19Q\r\xb7\xc8s\x98\xe9\xc8',
   'keyId': 7099986247761788929}},
 'operationTime': Timestamp(1668089416, 1)}

Of course this is not optimal..... What we would need is a simple connection name/s of the replicaset members to connect to (excluding the hash from the pods' names). Now, how exactly (from K8 perspective) the solution would best be, I am not the right person to tell. But we already know one thing from past experience: Having one dynamic component chosing a service/pod to serve the connection (from K8) and then another one chossing the repicaset member (from MongoDB) to connect to is still working, but we do experience from time to time exceptions of the kind Not a replicaset master.

vkuznet · 2022-11-10T14:33:34Z

@todor-ivanov please clearly outline requirements as I cannot guess them. For instance:

we need access from inside and outside CERN network to
explicit replica node
etc.

Based on requirements we may need to setup k8s infrastructure. Then, we can discuss implementation details outlined in this PR. Sorry, but so far I see no need to review PR without clear idea how cluster should be set and what should be exposed.

Said that, it does not mean we can do everything. For instance, if we need to open firewall to each minion on specific port I can't guarantee that such request can be granted by CERN security team. Anything related to firewall should be audited first, etc. And I would like to avoid doing unnecessary steps if later we'll decide not to use specific setup.

Therefore, we need clear set of requirements and discuss them first if we can provide setup to match them.

todor-ivanov · 2022-11-10T16:17:44Z

Hi @vkuznet, I do not understand why are we deviating in discussing firewall rules etc.

we need access from inside and outside CERN network to

No, this is not supposed to be seen from outside CERN. We do not need to deal with firewall rules or what so ever here. This PR is consistent. It does not put any requirement on how the K8 setup is already or will be done. With this PR I only give the ability of our code to utilize all possible connection setups that the standard MongoClient supports. The whole problem in the past was that we were not setting up the members of the replicaset explicitly and we were relying on some external component (in this case a piece from K8) to give us the ability to use a single name as an entry point.

explicit replica node

I do not understand what you mean here. The only thing that I was hoping we may get at some point is to avoid using names for connecting to database servers containing some random "hash-like" strings in them - like mongodb-m7wwpbnzgide-node-0.cern.ch. And this @muhammadimranfarooqi has already managed to achieve somehow in the testbed MongoDB cluster. Maybe someone who has access to them both can make a comparison of the two setups.

Again, this PR does not require any change in the Kubernetes setup. I have already tested multiple times both production and testbed MongoDB clusters. They all work perfectly with the current PR.

todor-ivanov · 2022-11-10T16:28:16Z

Actually, I can see what may have lead to a confusion here. Maybe this comment: #11360 (comment) may have left people with the wrong impression that we MUST change the production cluster.

No, we do not need to change anything. We can connect to the single server name as we were doing before or to an explicitly listed members of the replicaset. I did put that many details in this previous comment only to point out the difference of the two setups we have in production and in testbed. And also to justify my note that if we want to be backwards compatible with the current change we do need to mind this difference.

cmsdmwmbot · 2022-11-10T18:07:05Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 19 warnings
- 158 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 15 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13703/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-11-10T18:09:00Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
Python3 Pylint check: failed
- 3 warnings and errors that must be fixed
- 19 warnings
- 158 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 15 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13704/artifact/artifacts/PullRequestReport.html

amaltaro · 2022-11-10T19:17:07Z

I do not understand what you mean here. The only thing that I was hoping we may get at some point is to avoid using names for connecting to database servers containing some random "hash-like" strings in them - like mongodb-m7wwpbnzgide-node-0.cern.ch. And this @muhammadimranfarooqi has already managed to achieve somehow in the testbed MongoDB cluster. Maybe someone who has access to them both can make a comparison of the two setups.

Todor, in light of the points made by Valentin, I think I don't understand what the goal is here either.

Right now, this is what we use in the production system:

data.mongoDBUrl = 'mongodb://USER:[email protected]'

so I failed to see anywhere where we need to connect - or to pass in - a specific pod url (which includes the pod hash).
Accessing MongoDB through a simple url is the simplest way to go IMO.

... do experience from time to time exceptions of the kind Not a replicaset master.

can you please clarify where we see this problem? Which service? In which scenario?

cmsdmwmbot · 2022-11-11T10:23:57Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 19 warnings
- 158 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 15 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13707/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2022-11-11T10:43:45Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 changes in unstable tests
Python3 Pylint check: succeeded
- 19 warnings
- 158 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 15 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13709/artifact/artifacts/PullRequestReport.html

amaltaro · 2022-11-11T11:11:21Z

From the chat that Todor and I had yesterday, my understanding of this development is:

the current production setup uses a single entry point name to access the MongoDB resources. This entry point is a load balancer redirecting user connections to a given MongoDB replica instance.
while the recently commissioned testbed MongoDB setup no longer provides a load balancer in front of the replica, thus an alias for all the replica instances is provided in the configuration and connection happens directly between client and server.

Ideally, we should have the same architecture/setup for both testbed and production MongoDB clusters. I updated the JIRA ticket on that regard.

In addition to that, I understand that MongoDB replicaset works as the following:

for write operations, the client must be connected against the primary (master) MongoDB replica instance, otherwise write operations (might/)will fail. Hence, that retry logic that Todor added to the MSUnmerged microservice.
for read operations, it can be performed through any of the MongoDB replica instances.
the MongoDB replicaset intrinsically provides data replication and a HA setup, but it can dynamically change which instance is the primary, hence every now and then causing document write failures.

I confess that I need to educate myself on this. So far, my understanding is mostly with discussions here and there and I trust on your investigation to make the best choices here. Please let me know if I got anything wrong and/or missed anything.

todor-ivanov · 2022-11-11T12:20:28Z

Thanks @amaltaro. Your comment here does reflect the situation quite accurate. Just to add some official documentation from MongoDB explaining and visualizing that exact bit about the Primary Replicaset Member being the only one to be allowed to write to the database [1].

Again I need to mention, we are capable of supporting both cases - with an external loadbalancer providing a single entry point to connect to and without it - defining separately all replicaset members in the connection string. And it is up to us to chose which one we would like to have (preferable the same in both production and testbed), but this PR does not put a strong requirement that we must change things in K8 now. This may easely be followed with a separate GH issue.

p.s. Just a note here: Further recommendations for best HA setup and data redundancy from the MongoDB documentation suggest to have the different members of the replicaset existing on physically separate nodes.

[1]
https://www.mongodb.com/docs/manual/core/replica-set-members/

todor-ivanov · 2022-11-14T19:01:31Z

thanks @vkuznet

amaltaro

Todor, these changes are looking good to me. However, I left a few comments that might actually need to be considered for this PR.

I also have a question concerning the services configuration, I see you have updated the mongoDBReplicaset name. Does it require any synchronization between WMCore deployment and changes to the MongoDB cluster?

src/python/WMCore/MicroService/MSOutput/MSOutput.py

src/python/WMCore/Database/MongoDB.py

amaltaro · 2022-11-14T19:59:57Z

src/python/WMCore/Database/MongoDB.py

        try:
            if mockMongoDB:
                self.client = mongomock.MongoClient()
                self.logger.info("NOTICE: MongoDB is set to use mongomock, instead of real database.")
            elif replicaset:
-                self.client = MongoClient(self.server, self.port, replicaset=replicaset, **kwargs )
+                self.client = MongoClient(host=self.server, port=self.port, replicaset=replicaset, **kwargs )


Different comment on this same line. It looks like there is no need for this elif-else block for the replicaset and we could be passing None value in case it's not defined, which is the default value.

That I believe is a leftover from when we were trying to build the full connection string for the server Url ourselves and not splitting all the parameters to be passed to the client separately. Now this is not needed, indeed.

Something more, I now notice that the client parameter from the documentation is actually replicaSet not replicaset, but I did check and it was properly recognized even with the typo in it. If I change it to a non existent replicaSet it does timeout [1]. And it works because obviously all the client parameters get lower-cased at runtime: [2]. I am fixing it anyway because I'd like to have all the variables following our proper naming convention. (here and in all the relevant config files as well).

[1]

$ ipython -i -- /data/tmp/WMCore.venv3/srv/WMCore/bin/adhoc-scripts/mongoInit.py -c /data/tmp/WMCore.venv3/srv/current/config/reqmgr2ms-output/config-output.py 2022-11-15 10:50:06,353:INFO:mongoInit:<module>(): Connecting to MongoDB using the following mongoDBConfig: {'connect': True, 'create': False, 'database': 'msOutputDBPreProd', 'directConnection': False, 'logger': <Logger __main__ (INFO)>, 'mockMongoDB': False, 'password': '****', 'port': None, 'replicaset': 'cmsweb-dev', 'server': ['mongodb-test.cern.ch:32001', 'mongodb-test.cern.ch:32002', 'mongodb-test.cern.ch:32003'], 'username': '****'} ... ServerSelectionTimeoutError: No replica set members available for replica set name "cmsweb-dev", Timeout: 30s, Topology Description: <TopologyDescription id: 637360cef3d1bc38070ce01b, topology_type: ReplicaSetNoPrimary, servers: []>

[2]

$ ipython -i -- /data/tmp/WMCore.venv3/srv/WMCore/bin/adhoc-scripts/mongoInit.py -c /data/tmp/WMCore.venv3/srv/current/config/reqmgr2ms-output/config-output.py 2022-11-15 10:53:25,406:INFO:mongoInit:<module>(): Connecting to MongoDB using the following mongoDBConfig: {'connect': True, 'create': False, 'database': 'msOutputDBPreProd', 'directConnection': False, 'logger': <Logger __main__ (INFO)>, 'mockMongoDB': False, 'password': '****', 'port': None, 'replicaSet': 'cmsweb-test', 'server': ['mongodb-test.cern.ch:32001', 'mongodb-test.cern.ch:32002', 'mongodb-test.cern.ch:32003'], 'username': '****'} In [1]: mongoClt Out[1]: MongoClient(host=['mongodb-test.cern.ch:32002', 'mongodb-test.cern.ch:32001', 'mongodb-test.cern.ch:32003'], document_class=dict, tz_aware=False, connect=True, replicaset='cmsweb-test', directconnection=False) In [2]: mongoClt.topology_description.replica_set_name Out[2]: 'cmsweb-test'

src/python/WMCore/Database/MongoDB.py

cmsdmwmbot · 2022-11-15T13:06:43Z

Jenkins results:

Python3 Unit tests: failed
- 24 new failures
- 1 tests no longer failing
- 1 changes in unstable tests
Python3 Pylint check: succeeded
- 19 warnings
- 158 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13729/artifact/artifacts/PullRequestReport.html

amaltaro · 2022-11-15T18:30:17Z

test this please

cmsdmwmbot · 2022-11-15T18:41:53Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 23 tests no longer failing
- 1 changes in unstable tests
Python3 Pylint check: succeeded
- 19 warnings
- 158 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13734/artifact/artifacts/PullRequestReport.html

amaltaro

Todor, your latest changes triggered further review, please find them along the code.

I also repeat my question from a previous review: "I also have a question concerning the services configuration, I see you have updated the mongoDBReplicaset name. Does it require any synchronization between WMCore deployment and changes to the MongoDB cluster?"

src/python/WMCore/Database/MongoDB.py

amaltaro · 2022-11-15T19:24:51Z

src/python/WMCore/MicroService/MSUnmerged/MSUnmerged.py

-                    msg += "Skipping the current run."
-                    self.logger.error(msg)
-                    return summary
+                # if not protectedLFNsT0:


If you want to touch this code, I would then suggest to remove it completely (and the same applies to the service configuration). Otherwise, just leave it as it was before. There is no need to fix all the pylint messages, especially when it's about code that you did not touch within the PR.

That's fine with me. The less I change in this PR the cleaner it looks at the end. I am all for leaving it as it was before.

cmsdmwmbot · 2022-11-16T15:35:33Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests no longer failing
- 1 changes in unstable tests
Python3 Pylint check: failed
- 1 warnings and errors that must be fixed
- 19 warnings
- 158 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13740/artifact/artifacts/PullRequestReport.html

amaltaro

Changes look good to me.

Can you please:

squash these commits accordingly
answer the question in my previous email
? Thanks!

todor-ivanov · 2022-11-17T08:18:55Z

Thanks @amaltaro!

I also have a question concerning the services configuration, I see you have updated the mongoDBReplicaset name. Does it require any synchronization between WMCore deployment and changes to the MongoDB cluster?

I missed to answer that question indeed. Sorry for that. No, the parameter name and the MongoDBAsAService cluster configuration have nothing to do with each other. It is only about following our proper naming practices. I did reflect the new variable name in all relevant services_config PRs, so it will be consistent for the next deployment. As of the decision on how and if to change the setup for the production MongoDBAsAService cluster, I'd say we leave it be decided and dealt with at the relevant Jira ticket: [1].

[1]
https://its.cern.ch/jira/browse/CMSKUBERNETES-175

Fix some config variable names && Print out a blured configuration dictionry for mongoInit.py Fix missing ex var && Bad collection assignment from mongoDB object. Fixing replicaSet variable name. Document some more parameters in MongoDB docstring && Revert a commented code blocks in MSUnmerged.

Unit tests fixes - pylint warnings. Pylint fixes.

cmsdmwmbot · 2022-11-17T11:23:58Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 1 tests no longer failing
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 19 warnings
- 158 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/13742/artifact/artifacts/PullRequestReport.html

todor-ivanov requested review from amaltaro and vkuznet November 9, 2022 14:51

amaltaro requested changes Nov 9, 2022

View reviewed changes

src/python/WMCore/MicroService/MSOutput/MSOutput.py Show resolved Hide resolved

src/python/WMCore/Database/MongoDB.py Outdated Show resolved Hide resolved

src/python/WMCore/Database/MongoDB.py Show resolved Hide resolved

todor-ivanov force-pushed the feature_MongoDBAAS_validateTestClusters_fix-11210 branch from 3bd7bc2 to 0025ac2 Compare November 11, 2022 10:35

todor-ivanov requested a review from amaltaro November 11, 2022 12:20

vkuznet approved these changes Nov 14, 2022

View reviewed changes

amaltaro requested changes Nov 14, 2022

View reviewed changes

todor-ivanov requested a review from amaltaro November 15, 2022 12:54

amaltaro requested changes Nov 15, 2022

View reviewed changes

todor-ivanov requested a review from amaltaro November 16, 2022 18:58

amaltaro approved these changes Nov 16, 2022

View reviewed changes

amaltaro added the PR: squashing needed label Nov 16, 2022

todor-ivanov added 2 commits November 17, 2022 12:01

Unit tests fixes.

c3fc065

Unit tests fixes - pylint warnings. Pylint fixes.

todor-ivanov force-pushed the feature_MongoDBAAS_validateTestClusters_fix-11210 branch from bf0495f to c3fc065 Compare November 17, 2022 11:09

amaltaro removed the PR: squashing needed label Nov 17, 2022

amaltaro merged commit c4055fb into dmwm:master Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch mongo client to connect to a fully defined replicaset. #11360

Switch mongo client to connect to a fully defined replicaset. #11360

todor-ivanov commented Nov 9, 2022 •

edited

Loading

cmsdmwmbot commented Nov 9, 2022

todor-ivanov commented Nov 9, 2022 •

edited

Loading

amaltaro left a comment

todor-ivanov commented Nov 10, 2022

cmsdmwmbot commented Nov 10, 2022

cmsdmwmbot commented Nov 10, 2022

vkuznet commented Nov 10, 2022

todor-ivanov commented Nov 10, 2022

vkuznet commented Nov 10, 2022

todor-ivanov commented Nov 10, 2022

todor-ivanov commented Nov 10, 2022

cmsdmwmbot commented Nov 10, 2022

cmsdmwmbot commented Nov 10, 2022

amaltaro commented Nov 10, 2022

cmsdmwmbot commented Nov 11, 2022

cmsdmwmbot commented Nov 11, 2022

amaltaro commented Nov 11, 2022

todor-ivanov commented Nov 11, 2022

todor-ivanov commented Nov 14, 2022

amaltaro left a comment

amaltaro Nov 14, 2022

todor-ivanov Nov 15, 2022

cmsdmwmbot commented Nov 15, 2022

amaltaro commented Nov 15, 2022

cmsdmwmbot commented Nov 15, 2022

amaltaro left a comment

amaltaro Nov 15, 2022

todor-ivanov Nov 16, 2022

cmsdmwmbot commented Nov 16, 2022

amaltaro left a comment

todor-ivanov commented Nov 17, 2022 •

edited

Loading

cmsdmwmbot commented Nov 17, 2022

Switch mongo client to connect to a fully defined replicaset. #11360

Switch mongo client to connect to a fully defined replicaset. #11360

Conversation

todor-ivanov commented Nov 9, 2022 • edited Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented Nov 9, 2022

todor-ivanov commented Nov 9, 2022 • edited Loading

amaltaro left a comment

Choose a reason for hiding this comment

todor-ivanov commented Nov 10, 2022

cmsdmwmbot commented Nov 10, 2022

cmsdmwmbot commented Nov 10, 2022

vkuznet commented Nov 10, 2022

todor-ivanov commented Nov 10, 2022

vkuznet commented Nov 10, 2022

todor-ivanov commented Nov 10, 2022

todor-ivanov commented Nov 10, 2022

cmsdmwmbot commented Nov 10, 2022

cmsdmwmbot commented Nov 10, 2022

amaltaro commented Nov 10, 2022

cmsdmwmbot commented Nov 11, 2022

cmsdmwmbot commented Nov 11, 2022

amaltaro commented Nov 11, 2022

todor-ivanov commented Nov 11, 2022

todor-ivanov commented Nov 14, 2022

amaltaro left a comment

Choose a reason for hiding this comment

amaltaro Nov 14, 2022

Choose a reason for hiding this comment

todor-ivanov Nov 15, 2022

Choose a reason for hiding this comment

cmsdmwmbot commented Nov 15, 2022

amaltaro commented Nov 15, 2022

cmsdmwmbot commented Nov 15, 2022

amaltaro left a comment

Choose a reason for hiding this comment

amaltaro Nov 15, 2022

Choose a reason for hiding this comment

todor-ivanov Nov 16, 2022

Choose a reason for hiding this comment

cmsdmwmbot commented Nov 16, 2022

amaltaro left a comment

Choose a reason for hiding this comment

todor-ivanov commented Nov 17, 2022 • edited Loading

cmsdmwmbot commented Nov 17, 2022

todor-ivanov commented Nov 9, 2022 •

edited

Loading

todor-ivanov commented Nov 9, 2022 •

edited

Loading

todor-ivanov commented Nov 17, 2022 •

edited

Loading