Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage node empty after 1.1.2 update and compaction #236

Closed
Licenser opened this issue Sep 7, 2014 · 6 comments
Closed

storage node empty after 1.1.2 update and compaction #236

Licenser opened this issue Sep 7, 2014 · 6 comments
Milestone

Comments

@Licenser
Copy link
Contributor

Licenser commented Sep 7, 2014

I updated three storage nodes from 1.0.1 to 1.1.2 the other day and ran a compaction afterwards, two of the nodes have the 'correct' size the third one has only a fraction of the files.

The files are properly gone (the size on the filesystem is the same as on the node)

I also haven't seen the nodes to 'repair' the broken node and repopulate it with data (N=3), is there any way to fore that?

@yosukehara
Copy link
Member

Thank you for your report. I want to know state and others of the incorrect node during the compaction as follows. We're going to reproduce and fix this issue ASAP.

  • error log of the incorrect node during the compaction
  • state of the incorrect node during the compaction - [running or suspend or stop or restarted]
  • the number of replicas at leo_manager_0.conf

Also, we've provided the recover-node command. You're able to recover the incorrect node with it.

$ leofs-adm recover-node <incorrect-node>

@Licenser
Copy link
Contributor Author

Licenser commented Sep 7, 2014

Hi @yosukehara,

  • the number of replicas is 3: consistency.num_of_replicas = 3
  • The error logs show nothing special (see attached)
  • I ran recover node, however it seems to have no effect (this show before and after running it):
du [email protected]
 active number of objects: 1104
  total number of objects: 1112
   active size of objects: 5626193903
    total size of objects: 5626195959
     ratio of active size: 100.0%
    last compaction start: 2014-09-05 15:36:24 +0000
      last compaction end: 2014-09-05 15:36:25 +0000

The logs after starting the reoccur stay entirely empty.

Logs of the compaction after update:

[W]     [email protected]       2014-09-02 18:59:15.881899 +0000        1409684355      leo_storage_replicator:replicate_fun/2  183     key:fifo-snapshots/ff8d9045-7a76-4a54-b169-7b3e4232d51a/23043bc5-deff-4975-a5b5-51d1e056d27a, node:local, reqid:106286477, cause:not_found
[E]     [email protected]       2014-09-04 10:08:54.701695 +0000        1409825334      null:null       0       ["/var/db/leo_storage/queue/1/index0/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:08:56.54477 +0000 1409825336      null:null       0       ["/var/db/leo_storage/queue/1/index1/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:08:57.413895 +0000        1409825337      null:null       0       ["/var/db/leo_storage/queue/1/index2/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:08:58.765452 +0000        1409825338      null:null       0       ["/var/db/leo_storage/queue/1/index3/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:00.71899 +0000 1409825340      null:null       0       ["/var/db/leo_storage/queue/1/index4/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:01.407317 +0000        1409825341      null:null       0       ["/var/db/leo_storage/queue/1/index5/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:02.742560 +0000        1409825342      null:null       0       ["/var/db/leo_storage/queue/1/index6/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:04.70086 +0000 1409825344      null:null       0       ["/var/db/leo_storage/queue/1/index7/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:05.355439 +0000        1409825345      null:null       0       ["/var/db/leo_storage/queue/1/message1/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:06.529988 +0000        1409825346      null:null       0       ["/var/db/leo_storage/queue/1/message2/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:07.883603 +0000        1409825347      null:null       0       ["/var/db/leo_storage/queue/1/message3/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:09.329150 +0000        1409825349      null:null       0       ["/var/db/leo_storage/queue/1/message4/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:10.689358 +0000        1409825350      null:null       0       ["/var/db/leo_storage/queue/1/message5/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:12.53335 +0000 1409825352      null:null       0       ["/var/db/leo_storage/queue/1/message6/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 10:09:13.545194 +0000        1409825353      null:null       0       ["/var/db/leo_storage/queue/1/message7/1.bitcask.hint"]
[E]     [email protected]       2014-09-04 14:42:29.11179 +0000 1409841749      leo_storage_api:register_in_monitor/1   85      manager:'[email protected]', cause:{badrpc,{'EXIT',{timeout,{gen_server,call,[leo_manager_cluster_monitor,{register,{registration,<0.64.0>,'[email protected]',storage,again,[],[],168,13077}},30000]}}}}
[E]     [email protected]       2014-09-04 14:45:23.895714 +0000        1409841923      leo_membership_cluster_local:compare_with_remote_chksum/3       401     {'[email protected]',nodedown}
[E]     [email protected]       2014-09-04 14:45:28.899194 +0000        1409841928      leo_membership_cluster_local:compare_with_remote_chksum/3       401     {'[email protected]',nodedown}
[E]     [email protected]       2014-09-04 14:45:38.903254 +0000        1409841938      leo_membership_cluster_local:compare_with_remote_chksum/3       401     {'[email protected]',nodedown}
[E]     [email protected]       2014-09-04 14:45:53.914130 +0000        1409841953      leo_membership_cluster_local:compare_with_remote_chksum/3       401     {'[email protected]',nodedown}
[E]     [email protected]       2014-09-04 14:46:23.937073 +0000        1409841983      leo_membership_cluster_local:notify_error_to_manager/3  428     {'[email protected]',{error,"Fail to synchronize RING"}}
[E]     [email protected]       2014-09-04 14:46:28.946490 +0000        1409841988      leo_membership_cluster_local:notify_error_to_manager/3  428     {'[email protected]',{error,"Fail to synchronize RING"}}
[E]     [email protected]       2014-09-04 14:46:33.954922 +0000        1409841993      leo_membership_cluster_local:notify_error_to_manager/3  428     {'[email protected]',{error,"Fail to synchronize RING"}}
[root

@yosukehara
Copy link
Member

I've modify leo_redundant_manager of judgement of node fun, which 'true' is returned even if it detects an error of RING - leo-project/leo_redundant_manager@d467a28.

Also, I'd like to know the cause of no effect of recover-node as follows:

  • During the "recover-node" command:
    • error log of the storage nodes (exclude "[email protected]")
    • error log of the manager node

@yosukehara yosukehara added this to the 1.1.3 milestone Sep 8, 2014
@Licenser
Copy link
Contributor Author

Licenser commented Sep 8, 2014

H mate,
very good call on the logs of the other storage nodes :) so here you go:

https://gist.github.com/Licenser/6cea82808fccc65f3885

The manages execute without any errors so Ive not included them (logs are entirely entry).

@yosukehara
Copy link
Member

Thank you for sharing. Linked the related issue: /issues/237

@yosukehara
Copy link
Member

We fixed this issue with v1.1.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants