Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition in a call to GD conn manager #1750

Merged
merged 2 commits into from
Mar 5, 2018

Conversation

fenek
Copy link
Member

@fenek fenek commented Mar 2, 2018

This PR addresses an issue observed in load tests, where sometimes a call to get_connection crashed with noproc exception (that applies to connection manager process).

It may happen if a process starts outgoing connections supervisor and second one tries to access the pool in a state, when supervisor is already running but manager (child of the supervisor) is not (yet).

Also please check the comments in the code.

Load tests confirm that this branch fixes the issue.

Error ->
?INFO_MSG("event=outgoing_conn_start_progress,server='~s',"
"state='failed',error='~p'", [Server, Error]),
Error
end.

-spec get_connection(Server :: jid:lserver()) -> pid().
get_connection(Server) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be still relevant
https://github.com/esl/MongooseIM/pull/1660/files#r165371814

I am feeling that we would find more bugs in this part of GD.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it was mentioned in another PR, maybe it should rather be resolved there? But +1 for adding some comment about the behaviour.

get_connection(Server) ->
mod_global_distrib_server_mgr:get_connection(Server).
case catch mod_global_distrib_server_mgr:get_connection(Server) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to see a test for this fix.

Copy link
Member Author

@fenek fenek Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a TODO comment here because we need to come up with a uniform method of simulating race conditions. Without it, I would be forced to use meck+RPC which is evil and failure-prone.

?ERROR_MSG("event=cannot_acquire_outgoing_connection,server='~s'", [Server]),
throw({error, {cannot_acquire_outgoing_connection, Server}});
get_connection(Server, RetriesLeft) ->
case mod_global_distrib_server_sup:get_connection(Server) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not covered with tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

@fenek fenek merged commit fc9abfe into master Mar 5, 2018
@fenek fenek deleted the fix-gd-mgr-race-condition branch March 5, 2018 08:47
@fenek fenek added this to the 3.0.0 milestone Mar 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants