-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a race condition in a call to GD conn manager #1750
Conversation
Error -> | ||
?INFO_MSG("event=outgoing_conn_start_progress,server='~s'," | ||
"state='failed',error='~p'", [Server, Error]), | ||
Error | ||
end. | ||
|
||
-spec get_connection(Server :: jid:lserver()) -> pid(). | ||
get_connection(Server) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be still relevant
https://github.com/esl/MongooseIM/pull/1660/files#r165371814
I am feeling that we would find more bugs in this part of GD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it was mentioned in another PR, maybe it should rather be resolved there? But +1 for adding some comment about the behaviour.
get_connection(Server) -> | ||
mod_global_distrib_server_mgr:get_connection(Server). | ||
case catch mod_global_distrib_server_mgr:get_connection(Server) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to see a test for this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a TODO comment here because we need to come up with a uniform method of simulating race conditions. Without it, I would be forced to use meck+RPC which is evil and failure-prone.
?ERROR_MSG("event=cannot_acquire_outgoing_connection,server='~s'", [Server]), | ||
throw({error, {cannot_acquire_outgoing_connection, Server}}); | ||
get_connection(Server, RetriesLeft) -> | ||
case mod_global_distrib_server_sup:get_connection(Server) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not covered with tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
This PR addresses an issue observed in load tests, where sometimes a call to
get_connection
crashed withnoproc
exception (that applies to connection manager process).It may happen if a process starts outgoing connections supervisor and second one tries to access the pool in a state, when supervisor is already running but manager (child of the supervisor) is not (yet).
Also please check the comments in the code.
Load tests confirm that this branch fixes the issue.