leofs-adm "version all" and "status" hang when node is offline #817

vstax · 2017-09-05T18:37:47Z

After (gracefully) shutdown leo_storage and powering off the system running it (so it can't answer to network connections), general node status shows "stop" for that node:

-------+-----------------------------+--------------+----------------+----------------+----------------------------
 type  |            node             |    state     |  current ring  |   prev ring    |          updated at
-------+-----------------------------+--------------+----------------+----------------+----------------------------
  S    | [email protected]      | running      | c1d863d0       | c1d863d0       | 2017-09-05 20:14:06 +0300
  S    | [email protected]      | running      | c1d863d0       | c1d863d0       | 2017-09-05 20:16:29 +0300
  S    | [email protected]      | stop         |                |                | 2017-09-05 20:19:12 +0300
  G    | [email protected]      | running      | c1d863d0       | c1d863d0       | 2017-09-05 19:01:26 +0300
-------+-----------------------------+--------------+----------------+----------------+----------------------------

But command "leofs-adm status [email protected]" will hang / timeout (even though
it should be able to know that node is "stopped" so there is no way you can get its status); what's worse, "version all" hangs just like that as well.

The text was updated successfully, but these errors were encountered:

mocchira · 2017-09-07T05:46:20Z

@vstax thanks for filing this issue.

I can reproduce this behavior on my dev-box and turned out the reason why this could happen.
With just having a leo_storage process shutdown, the hang doesn't happen instead the console display the result "[ERROR] nodedown" immediately because the erlang runtime still can connect epmd runnin on a remote leo_storage node and get whether the process is alive or not. OTOH with having a leo_storage node stopped, the hang happens (strictly speaking, waiting for 120 seconds in case leofs-adm status, hang up in case leofs-adm version all).

We are going to improve this behavior by setting an appropriate timeout (several seconds) for both of them and also consider to check the node status before connecting and show the error result if it's stopped.

mocchira · 2017-09-07T07:06:48Z

and also turned out there are other code paths that can cause leofs-adm to hang up so I will fix those paths as well as status/version all cases.

vstax · 2017-09-13T15:24:10Z

The timeout fix works for me (although I don't get "5" seconds - it seems to float from much smaller ones to 7 seconds or so per node):

[root@leo-m0 ~]# time /usr/local/bin/leofs-adm status [email protected]
[ERROR] nodedown

real	0m7.024s
user	0m0.015s
sys	0m0.008s
[root@leo-m0 ~]# time /usr/local/bin/leofs-adm status [email protected]
[ERROR] nodedown

real	0m1.057s
user	0m0.012s
sys	0m0.008s
[root@leo-m0 ~]# time /usr/local/bin/leofs-adm status [email protected]
[ERROR] nodedown

real	0m3.040s
user	0m0.012s
sys	0m0.020s

[root@leo-m0 ~]# time /usr/local/bin/leofs-adm version all
 [Version of Node(s)]
-------+-----------------------------+----------------------------------+
 type  |            node             |             version              | 
-------+-----------------------------+----------------------------------+
  M    | [email protected]      | 1.3.7                            | 
  M    | [email protected]      | 1.3.7                            | 
  S    | [email protected]      | 1.3.7                            | 
  S    | [email protected]      | Failed to get the version        | 
  S    | [email protected]      | Failed to get the version        | 
  G    | [email protected]      | 1.3.7                            | 
-------+-----------------------------+----------------------------------+


real	0m10.043s
user	0m0.015s
sys	0m0.010s

[root@leo-m0 ~]# time /usr/local/bin/leofs-adm version all
 [Version of Node(s)]
-------+-----------------------------+----------------------------------+
 type  |            node             |             version              | 
-------+-----------------------------+----------------------------------+
  M    | [email protected]      | 1.3.7                            | 
  M    | [email protected]      | 1.3.7                            | 
  S    | [email protected]      | 1.3.7                            | 
  S    | [email protected]      | Failed to get the version        | 
  S    | [email protected]      | 1.3.7                            | 
  G    | [email protected]      | 1.3.7                            | 
-------+-----------------------------+----------------------------------+


real	0m1.439s
user	0m0.017s
sys	0m0.007s

but this is probably not a problem, more like curious behavior.

mocchira · 2017-09-14T00:12:45Z

@vstax thanks for confirming.

The timeout fix works for me (although I don't get "5" seconds - it seems to float from much smaller ones to 7 seconds or so per node):

Hmm. interesting.
I will check how such timeouts could happen after get other high priority tasks done.

mocchira · 2017-09-14T08:16:49Z

My quick research revealed the fact that the time to take rpc:call to an unreachable host with timeout set to 5 seconds float from around 3 seconds to 7 seconds like below

(manager_0@127.0.0.1)3> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7003172,{badrpc,nodedown}}
(manager_0@127.0.0.1)4> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7003355,{badrpc,nodedown}}
(manager_0@127.0.0.1)5> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7001668,{badrpc,nodedown}}
(manager_0@127.0.0.1)6> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7003688,{badrpc,nodedown}}
(manager_0@127.0.0.1)7> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{7004304,{badrpc,nodedown}}
(manager_0@127.0.0.1)8> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{3011308,{badrpc,nodedown}}
(manager_0@127.0.0.1)9> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{2997826,{badrpc,nodedown}}
(manager_0@127.0.0.1)10> timer:tc(rpc, call, ['[email protected]', erlang, node, [], 5000]).
{2999722,{badrpc,nodedown}}

That said, the erlang runtime might mark the host is unreachable by not timeout but some other factor (heartbeat running in background might be involved) and in not timeout cases, the time to take might deviate from 5 seconds I think.

mocchira · 2017-09-15T08:44:32Z

@vstax I'm going to close this issue and file the another issue that aim to survey why the time to respond deviates from what we expect (5 sec in this case). Thoughts?

vstax · 2017-09-15T08:49:50Z

@mocchira Sure, if you think that's important (like, there are other cases where timeout might be not what it's supposed to be).

mocchira · 2017-09-15T09:18:45Z

@vstax thanks.

like, there are other cases where timeout might be not what it's supposed to be

Yes I've concerned a little bit about it so keep vetting on another ticket.

mocchira self-assigned this Sep 7, 2017

mocchira added Bug Improve Priority-MIDDLE _leo_manager labels Sep 7, 2017

yosukehara added this to the 1.3.7 milestone Sep 7, 2017

mocchira mentioned this issue Sep 8, 2017

manager: set timeout to prevent leofs-adm from hanging up #820

Merged

mocchira closed this as completed Sep 15, 2017

mocchira mentioned this issue Sep 15, 2017

[all] The response time of rpc:call with unreachable host deviates from its timeout #833

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

leofs-adm "version all" and "status" hang when node is offline #817

leofs-adm "version all" and "status" hang when node is offline #817

vstax commented Sep 5, 2017

mocchira commented Sep 7, 2017

mocchira commented Sep 7, 2017

vstax commented Sep 13, 2017

mocchira commented Sep 14, 2017

mocchira commented Sep 14, 2017

mocchira commented Sep 15, 2017

vstax commented Sep 15, 2017

mocchira commented Sep 15, 2017

leofs-adm "version all" and "status" hang when node is offline #817

leofs-adm "version all" and "status" hang when node is offline #817

Comments

vstax commented Sep 5, 2017

mocchira commented Sep 7, 2017

mocchira commented Sep 7, 2017

vstax commented Sep 13, 2017

mocchira commented Sep 14, 2017

mocchira commented Sep 14, 2017

mocchira commented Sep 15, 2017

vstax commented Sep 15, 2017

mocchira commented Sep 15, 2017