Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leofs-adm "version all" and "status" hang when node is offline #817

Closed
vstax opened this issue Sep 5, 2017 · 8 comments
Closed

leofs-adm "version all" and "status" hang when node is offline #817

vstax opened this issue Sep 5, 2017 · 8 comments

Comments

@vstax
Copy link
Contributor

vstax commented Sep 5, 2017

After (gracefully) shutdown leo_storage and powering off the system running it (so it can't answer to network connections), general node status shows "stop" for that node:

-------+-----------------------------+--------------+----------------+----------------+----------------------------
 type  |            node             |    state     |  current ring  |   prev ring    |          updated at
-------+-----------------------------+--------------+----------------+----------------+----------------------------
  S    | [email protected]      | running      | c1d863d0       | c1d863d0       | 2017-09-05 20:14:06 +0300
  S    | [email protected]      | running      | c1d863d0       | c1d863d0       | 2017-09-05 20:16:29 +0300
  S    | [email protected]      | stop         |                |                | 2017-09-05 20:19:12 +0300
  G    | [email protected]      | running      | c1d863d0       | c1d863d0       | 2017-09-05 19:01:26 +0300
-------+-----------------------------+--------------+----------------+----------------+----------------------------

But command "leofs-adm status [email protected]" will hang / timeout (even though
it should be able to know that node is "stopped" so there is no way you can get its status); what's worse, "version all" hangs just like that as well.

@mocchira
Copy link
Member

mocchira commented Sep 7, 2017

@vstax thanks for filing this issue.

I can reproduce this behavior on my dev-box and turned out the reason why this could happen.
With just having a leo_storage process shutdown, the hang doesn't happen instead the console display the result "[ERROR] nodedown" immediately because the erlang runtime still can connect epmd runnin on a remote leo_storage node and get whether the process is alive or not. OTOH with having a leo_storage node stopped, the hang happens (strictly speaking, waiting for 120 seconds in case leofs-adm status, hang up in case leofs-adm version all).

We are going to improve this behavior by setting an appropriate timeout (several seconds) for both of them and also consider to check the node status before connecting and show the error result if it's stopped.

@mocchira
Copy link
Member

mocchira commented Sep 7, 2017

and also turned out there are other code paths that can cause leofs-adm to hang up so I will fix those paths as well as status/version all cases.

@vstax
Copy link
Contributor Author

vstax commented Sep 13, 2017

The timeout fix works for me (although I don't get "5" seconds - it seems to float from much smaller ones to 7 seconds or so per node):

[root@leo-m0 ~]# time /usr/local/bin/leofs-adm status [email protected]
[ERROR] nodedown

real	0m7.024s
user	0m0.015s
sys	0m0.008s
[root@leo-m0 ~]# time /usr/local/bin/leofs-adm status [email protected]
[ERROR] nodedown

real	0m1.057s
user	0m0.012s
sys	0m0.008s
[root@leo-m0 ~]# time /usr/local/bin/leofs-adm status [email protected]
[ERROR] nodedown

real	0m3.040s
user	0m0.012s
sys	0m0.020s

[root@leo-m0 ~]# time /usr/local/bin/leofs-adm version all
 [Version of Node(s)]
-------+-----------------------------+----------------------------------+
 type  |            node             |             version              | 
-------+-----------------------------+----------------------------------+
  M    | [email protected]      | 1.3.7                            | 
  M    | [email protected]      | 1.3.7                            | 
  S    | [email protected]      | 1.3.7                            | 
  S    | [email protected]      | Failed to get the version        | 
  S    | [email protected]      | Failed to get the version        | 
  G    | [email protected]      | 1.3.7                            | 
-------+-----------------------------+----------------------------------+


real	0m10.043s
user	0m0.015s
sys	0m0.010s

[root@leo-m0 ~]# time /usr/local/bin/leofs-adm version all
 [Version of Node(s)]
-------+-----------------------------+----------------------------------+
 type  |            node             |             version              | 
-------+-----------------------------+----------------------------------+
  M    | [email protected]      | 1.3.7                            | 
  M    | [email protected]      | 1.3.7                            | 
  S    | [email protected]      | 1.3.7                            | 
  S    | [email protected]      | Failed to get the version        | 
  S    | [email protected]      | 1.3.7                            | 
  G    | [email protected]      | 1.3.7                            | 
-------+-----------------------------+----------------------------------+


real	0m1.439s
user	0m0.017s
sys	0m0.007s

but this is probably not a problem, more like curious behavior.

@mocchira
Copy link
Member

@vstax thanks for confirming.

The timeout fix works for me (although I don't get "5" seconds - it seems to float from much smaller ones to 7 seconds or so per node):

Hmm. interesting.
I will check how such timeouts could happen after get other high priority tasks done.

@mocchira
Copy link
Member

My quick research revealed the fact that the time to take rpc:call to an unreachable host with timeout set to 5 seconds float from around 3 seconds to 7 seconds like below

(manager_0@127.0.0.1)3> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7003172,{badrpc,nodedown}}
(manager_0@127.0.0.1)4> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7003355,{badrpc,nodedown}}
(manager_0@127.0.0.1)5> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7001668,{badrpc,nodedown}}
(manager_0@127.0.0.1)6> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7003688,{badrpc,nodedown}}
(manager_0@127.0.0.1)7> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7004304,{badrpc,nodedown}}
(manager_0@127.0.0.1)8> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{3011308,{badrpc,nodedown}}
(manager_0@127.0.0.1)9> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{2997826,{badrpc,nodedown}}
(manager_0@127.0.0.1)10> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{2999722,{badrpc,nodedown}}

That said, the erlang runtime might mark the host is unreachable by not timeout but some other factor (heartbeat running in background might be involved) and in not timeout cases, the time to take might deviate from 5 seconds I think.

@mocchira
Copy link
Member

@vstax I'm going to close this issue and file the another issue that aim to survey why the time to respond deviates from what we expect (5 sec in this case). Thoughts?

@vstax
Copy link
Contributor Author

vstax commented Sep 15, 2017

@mocchira Sure, if you think that's important (like, there are other cases where timeout might be not what it's supposed to be).

@mocchira
Copy link
Member

@vstax thanks.

like, there are other cases where timeout might be not what it's supposed to be

Yes I've concerned a little bit about it so keep vetting on another ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants