storage node empty after 1.1.2 update and compaction #236

Licenser · 2014-09-07T06:19:29Z

I updated three storage nodes from 1.0.1 to 1.1.2 the other day and ran a compaction afterwards, two of the nodes have the 'correct' size the third one has only a fraction of the files.

The files are properly gone (the size on the filesystem is the same as on the node)

I also haven't seen the nodes to 'repair' the broken node and repopulate it with data (N=3), is there any way to fore that?

yosukehara · 2014-09-07T12:07:26Z

Thank you for your report. I want to know state and others of the incorrect node during the compaction as follows. We're going to reproduce and fix this issue ASAP.

error log of the incorrect node during the compaction
state of the incorrect node during the compaction - [running or suspend or stop or restarted]
the number of replicas at leo_manager_0.conf
- consistency.num_of_replicas

Also, we've provided the recover-node command. You're able to recover the incorrect node with it.

$ leofs-adm recover-node <incorrect-node>

Licenser · 2014-09-07T18:16:46Z

Hi @yosukehara,

the number of replicas is 3: consistency.num_of_replicas = 3
The error logs show nothing special (see attached)
I ran recover node, however it seems to have no effect (this show before and after running it):

du [email protected]
 active number of objects: 1104
  total number of objects: 1112
   active size of objects: 5626193903
    total size of objects: 5626195959
     ratio of active size: 100.0%
    last compaction start: 2014-09-05 15:36:24 +0000
      last compaction end: 2014-09-05 15:36:25 +0000

The logs after starting the reoccur stay entirely empty.

Logs of the compaction after update:

[W]     [email protected]       2014-09-02 18:59:15.881899 +0000        1409684355      leo_storage_replicator:replicate_fun/2  183     key:fifo-snapshots/ff8d9045-7a76-4a54-b169-7b3e4232d51a/23043bc5-deff-4975-a5b5-51d1e056d27a, node:local, reqid:106286477, cause:not_found
[E]     [email protected]       2014-09-04 10:08:54.701695 +0000        1409825334      null:null       0       ["/var/db/leo_storage/queue/1/index0/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:08:56.54477 +0000 1409825336      null:null       0       ["/var/db/leo_storage/queue/1/index1/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:08:57.413895 +0000        1409825337      null:null       0       ["/var/db/leo_storage/queue/1/index2/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:08:58.765452 +0000        1409825338      null:null       0       ["/var/db/leo_storage/queue/1/index3/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:00.71899 +0000 1409825340      null:null       0       ["/var/db/leo_storage/queue/1/index4/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:01.407317 +0000        1409825341      null:null       0       ["/var/db/leo_storage/queue/1/index5/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:02.742560 +0000        1409825342      null:null       0       ["/var/db/leo_storage/queue/1/index6/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:04.70086 +0000 1409825344      null:null       0       ["/var/db/leo_storage/queue/1/index7/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:05.355439 +0000        1409825345      null:null       0       ["/var/db/leo_storage/queue/1/message1/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:06.529988 +0000        1409825346      null:null       0       ["/var/db/leo_storage/queue/1/message2/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:07.883603 +0000        1409825347      null:null       0       ["/var/db/leo_storage/queue/1/message3/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:09.329150 +0000        1409825349      null:null       0       ["/var/db/leo_storage/queue/1/message4/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:10.689358 +0000        1409825350      null:null       0       ["/var/db/leo_storage/queue/1/message5/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:12.53335 +0000 1409825352      null:null       0       ["/var/db/leo_storage/queue/1/message6/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:13.545194 +0000        1409825353      null:null       0       ["/var/db/leo_storage/queue/1/message7/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 14:42:29.11179 +0000 1409841749      leo_storage_api:register_in_monitor/1   85      manager:'[email protected]', cause:{badrpc,{'EXIT',{timeout,{gen_server,call,[leo_manager_cluster_monitor,{register,{registration,<0.64.0>,'[email protected]',storage,again,[],[],168,13077}},30000]}}}}
[E]     [email protected]       2014-09-04 14:45:23.895714 +0000        1409841923      leo_membership_cluster_local:compare_with_remote_chksum/3       401     {'[email protected]',nodedown}
[E]     [email protected]       2014-09-04 14:45:28.899194 +0000        1409841928      leo_membership_cluster_local:compare_with_remote_chksum/3       401     {'[email protected]',nodedown}
[E]     [email protected]       2014-09-04 14:45:38.903254 +0000        1409841938      leo_membership_cluster_local:compare_with_remote_chksum/3       401     {'[email protected]',nodedown}
[E]     [email protected]       2014-09-04 14:45:53.914130 +0000        1409841953      leo_membership_cluster_local:compare_with_remote_chksum/3       401     {'[email protected]',nodedown}
[E]     [email protected]       2014-09-04 14:46:23.937073 +0000        1409841983      leo_membership_cluster_local:notify_error_to_manager/3  428     {'[email protected]',{error,"Fail to synchronize RING"}}
[E]     [email protected]       2014-09-04 14:46:28.946490 +0000        1409841988      leo_membership_cluster_local:notify_error_to_manager/3  428     {'[email protected]',{error,"Fail to synchronize RING"}}
[E]     [email protected]       2014-09-04 14:46:33.954922 +0000        1409841993      leo_membership_cluster_local:notify_error_to_manager/3  428     {'[email protected]',{error,"Fail to synchronize RING"}}
[root

yosukehara · 2014-09-08T08:01:25Z

I've modify leo_redundant_manager of judgement of node fun, which 'true' is returned even if it detects an error of RING - leo-project/leo_redundant_manager@d467a28.

Also, I'd like to know the cause of no effect of recover-node as follows:

During the "recover-node" command:
- error log of the storage nodes (exclude "[email protected]")
- error log of the manager node

Licenser · 2014-09-08T14:20:06Z

H mate,
very good call on the logs of the other storage nodes :) so here you go:

https://gist.github.com/Licenser/6cea82808fccc65f3885

The manages execute without any errors so Ive not included them (logs are entirely entry).

yosukehara · 2014-09-08T22:51:20Z

Thank you for sharing. Linked the related issue: /issues/237

yosukehara · 2014-09-12T08:08:15Z

We fixed this issue with v1.1.3.

yosukehara added this to the 1.1.3 milestone Sep 8, 2014

yosukehara closed this as completed Sep 12, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage node empty after 1.1.2 update and compaction #236

storage node empty after 1.1.2 update and compaction #236

Licenser commented Sep 7, 2014

yosukehara commented Sep 7, 2014

Licenser commented Sep 7, 2014

yosukehara commented Sep 8, 2014

Licenser commented Sep 8, 2014

yosukehara commented Sep 8, 2014

yosukehara commented Sep 12, 2014

storage node empty after 1.1.2 update and compaction #236

storage node empty after 1.1.2 update and compaction #236

Comments

Licenser commented Sep 7, 2014

yosukehara commented Sep 7, 2014

Licenser commented Sep 7, 2014

yosukehara commented Sep 8, 2014

Licenser commented Sep 8, 2014

yosukehara commented Sep 8, 2014

yosukehara commented Sep 12, 2014