-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
leofs-adm "version all" and "status" hang when node is offline #817
Comments
@vstax thanks for filing this issue. I can reproduce this behavior on my dev-box and turned out the reason why this could happen. We are going to improve this behavior by setting an appropriate timeout (several seconds) for both of them and also consider to check the node status before connecting and show the error result if it's stopped. |
and also turned out there are other code paths that can cause leofs-adm to hang up so I will fix those paths as well as status/version all cases. |
The timeout fix works for me (although I don't get "5" seconds - it seems to float from much smaller ones to 7 seconds or so per node):
but this is probably not a problem, more like curious behavior. |
@vstax thanks for confirming.
Hmm. interesting. |
My quick research revealed the fact that the time to take rpc:call to an unreachable host with timeout set to 5 seconds float from around 3 seconds to 7 seconds like below (manager_0@127.0.0.1)3> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7003172,{badrpc,nodedown}}
(manager_0@127.0.0.1)4> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7003355,{badrpc,nodedown}}
(manager_0@127.0.0.1)5> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7001668,{badrpc,nodedown}}
(manager_0@127.0.0.1)6> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7003688,{badrpc,nodedown}}
(manager_0@127.0.0.1)7> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7004304,{badrpc,nodedown}}
(manager_0@127.0.0.1)8> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{3011308,{badrpc,nodedown}}
(manager_0@127.0.0.1)9> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{2997826,{badrpc,nodedown}}
(manager_0@127.0.0.1)10> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{2999722,{badrpc,nodedown}} That said, the erlang runtime might mark the host is unreachable by not timeout but some other factor (heartbeat running in background might be involved) and in not timeout cases, the time to take might deviate from 5 seconds I think. |
@vstax I'm going to close this issue and file the another issue that aim to survey why the time to respond deviates from what we expect (5 sec in this case). Thoughts? |
@mocchira Sure, if you think that's important (like, there are other cases where timeout might be not what it's supposed to be). |
@vstax thanks.
Yes I've concerned a little bit about it so keep vetting on another ticket. |
After (gracefully) shutdown leo_storage and powering off the system running it (so it can't answer to network connections), general node status shows "stop" for that node:
But command "leofs-adm status [email protected]" will hang / timeout (even though
it should be able to know that node is "stopped" so there is no way you can get its status); what's worse, "version all" hangs just like that as well.
The text was updated successfully, but these errors were encountered: