Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors on gateway/storage when uploading lots of files #722

Closed
vstax opened this issue May 2, 2017 · 22 comments
Closed

Errors on gateway/storage when uploading lots of files #722

vstax opened this issue May 2, 2017 · 22 comments
Assignees
Milestone

Comments

@vstax
Copy link
Contributor

vstax commented May 2, 2017

Sorry for the long ticket and muddy description, I might be trying to squeeze a few unrelated problems in here, but I wanted to list them all because I don't know relations between them.
I'm trying to migrate existing data to LeoFS. I wrote a script that uploads data from filesystem and trying to run it on some small (development) subset of objects; their properties are - size from 1 byte to 20-25 MB, 120 KB on average. leo_gateway is running in S3 mode and the script basically wraps boto3's upload_file() method. Multi-threaded upload is disabled in boto3, as it can consume lots of RAM on the client; amount of retries set to 1, because I wanted this experiment not to hide possible problems.

Upload logic is as follows:

Create a list of files on given path (each name is 115 bytes with pattern like fad90104322ec7545b7ec7a19562773b4f77b82aeb73b715809547e874f86938f5b95361337c87edcf91b887d19bfc1dced4000000000000.xz). For each file in the list:

  1. Check for object existence in target storage, like HEAD body/fa/d9/01/fad90104322ec7545b7ec7a19562773b4f77b82aeb73b715809547e874f86938f5b95361337c87edcf91b887d19bfc1dced4000000000000.xz (where "body" is a name of bucket)
  2. If it doesn't exist, call boto3's upload_file() to the target file

Some (1-2%) of objects are already uploaded to this cluster, which is why checking for object existence is needed. Plus, it allows to stop and resume these uploads at any point. No objects are replaced, only new objects are uploaded. Autocompaction is disabled. Data is uploaded in parallel (6-10 processes at once) to test LeoFS cluster (3 storage nodes, N=2, W=1, R=1); I want this upload experiment to fully succeed before actually creating a real cluster (and repeating these uploads after that). However, I'm facing various problems and need some advice.
(yes, I'm aware that sometimes operations with gateway may fail due to high load / timeouts / watchdog / etc; that's what application logic, retries and so on are for. The problem here is not a failure by itself, current upload script doesn't do retries on purpose - on real production, amount of data we'll need to move is magnitudes bigger than for this experiment, I just expect this small amount of data - ~130GB in this case - to be moved without a hitch in order to be sure that production cluster will be able to handle its load)

Experiment 1

(performed on Apr 21 with development version from around that time. It's supposedly has fixes for "badarg" related to leveldb and such). I tried to upload data in 10 processes, each process was supposed to upload around 70K objects (~11 GB). The task was running at night so there should've been no other load on storage holding VMs with LeoFS. However, since it's the same storage for all VMs, some bad effects could be possible.

Problem 1: around 80 object failed to upload. On client error was

boto3.exceptions.S3UploadFailedError: Failed to upload /mnt/share2/storage/17/f3/17f3648377033b3fcc28c6253775e68a3e2243d208953886a3f5f01634d14f502e5a5c02df7085e763cfe7df0cf294150494000000000000.xz to body/17/f3/64/17f3648377033b3fcc28c6253775e68a3e2243d208953886a3f5f01634d14f502e5a5c02df7085e763cfe7df0cf294150494000000000000.xz: An error occurred (ServiceUnavailable) when calling the PutObject operation (reached max retries: 0): Please reduce your request rate.

On gateway: (no error)
On storage nodes: I can see this object in log on a single storage node.

[E]	[email protected]	2017-04-21 03:01:55.463434 +0300	1492732915	leo_storage_handler_object:put/2	355	[{from,gateway},{method,put},{key,<<"body/17/f3/64/17f3648377033b3fcc28c6253775e68a3e2243d208953886a3f5f01634d14f502e5a5c02df7085e763cfe7df0cf294150494000000000000.xz">>},{req_id,111293944},{cause,unavailable}]

Problem 2: some timeouts and other strange errors. Example, on gateway:

[W]     [email protected]  2017-04-21 02:44:19.441589 +0300        1492731859      leo_gateway_api:register_in_monitor/3   212     [{manager,'[email protected]'},{cause,timeout}]
[W]     [email protected]  2017-04-21 02:45:43.402525 +0300        1492731943      leo_gateway_rpc_handler:handle_error/5  303     [{node,'[email protected]'},{mod,leo_storage_handler_object},{method,put},{cause,{'EXIT',{timeout,{gen_server,call,[leo_logger_file_i,{append,{{message_log,0,"leo_storage_handler_object","put/2",344,"~p",[[{from,gateway},{method,put},{key,<<"body/76/19/e9/7619e9b5c3bea34eefd0c7c5fbd70e9041eb6e9d64e8673ba2be10d0902dbf0719633dc3607a9277cce28dcf737b391094f7000000000000.xz">>},{req_id,111794675}]],[],[]},0}}]}}}}]
[W]     [email protected]  2017-04-21 02:45:43.433566 +0300        1492731943      leo_gateway_rpc_handler:handle_error/5  303     [{node,'[email protected]'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]
[W]     [email protected]  2017-04-21 02:45:43.523553 +0300        1492731943      leo_gateway_rpc_handler:handle_error/5  303     [{node,'[email protected]'},{mod,leo_storage_handler_object},{method,put},{cause,timeout}]

On storage node:

[W]     [email protected]  2017-04-21 03:01:40.715237 +0300        1492732900      leo_storage_replicator:loop/6   209     [{method,p
ut},{key,<<"body/37/ef/00/37ef00984eede36a1d2db16386ac895336a6f123299b6788929d87f4b7befb553c3f4d3a4b0da72a0d3c1db3caef6edbf8b1020000000000
.xz">>},{cause,timeout}]
[W]     [email protected]  2017-04-21 03:01:41.712105 +0300        1492732901      leo_storage_replicator:replicate/5      121     [{
method,put},{key,<<"body/37/ef/00/37ef00984eede36a1d2db16386ac895336a6f123299b6788929d87f4b7befb553c3f4d3a4b0da72a0d3c1db3caef6edbf8b10200
00000000.xz">>},{cause,timeout}]
[E]     [email protected]  2017-04-21 03:01:41.712497 +0300        1492732901      leo_storage_handler_object:put/2        355     [{
from,gateway},{method,put},{key,<<"body/37/ef/00/37ef00984eede36a1d2db16386ac895336a6f123299b6788929d87f4b7befb553c3f4d3a4b0da72a0d3c1db3c
aef6edbf8b1020000000000.xz">>},{req_id,25202181},{cause,"Replicate failure"}]

There are quite a few cases of cause,timeout, cause, "Replicate failure" and cause,unavailable on storage nodes. What does "cause,unavailable" mean for PUT request, anyway?

One of the nodes also got these two messages in log:

error.20170421.02.1:[W]	[email protected]	2017-04-21 02:46:38.812308 +0300	1492731998	leo_storage_handler_object:read_and_repair_3/3	1149	[{key,<<"body/b1/99/6a/b1996a0e93adaa226806534c314b619d269fea2ed6b00e4f71c7a9663b71fac8e749f7e6018fca982a724c29d1d1e8cfced4000000000000.xz">>},{cause,{'EXIT',{timeout,{gen_server,call,[leo_logger_file_i,{append,{{message_log,0,"leo_storage_handler_object","get/1",72,"~p",[[{from,storage},{method,get},{key,<<"body/b1/99/6a/b1996a0e93adaa226806534c314b619d269fea2ed6b00e4f71c7a9663b71fac8e749f7e6018fca982a724c29d1d1e8cfced4000000000000.xz">>}]],[],[]},0}}]}}}}]
error.20170421.02.1:[E]	[email protected]	2017-04-21 02:46:38.823113 +0300	1492731998	leo_storage_handler_object:get/3	149	[{from,gateway},{method,get},{key,<<"body/b1/99/6a/b1996a0e93adaa226806534c314b619d269fea2ed6b00e4f71c7a9663b71fac8e749f7e6018fca982a724c29d1d1e8cfced4000000000000.xz">>},{req_id,81466965},{cause,{'EXIT',{timeout,{gen_server,call,[leo_logger_file_i,{append,{{message_log,0,"leo_storage_handler_object","get/1",72,"~p",[[{from,storage},{method,get},{key,<<"body/b1/99/6a/b1996a0e93adaa226806534c314b619d269fea2ed6b00e4f71c7a9663b71fac8e749f7e6018fca982a724c29d1d1e8cfced4000000000000.xz">>}]],[],[]},0}}]}}}}]

Problem 3: storage nodes restarting under load.

Unfortunately, other than information in erlang.log on storage node about restart and "EXIT" mentioned in log of gateway nodes, the restarting storage node leave no logs at all. The log goes on to the point when other nodes mention this one restarting, then a minute later logs start again; there are absolutely no errors related to restart or any crash dumps or anything. A few restarts of storage nodes happened during the cause of experiment, some errors on other storage nodes correlate to the moment when some storage node got restarted.

Experiment 2

(performed on May 1 with development version that was virtually like 1.3.3). I made changes to configuration of storage nodes: set ERL_CRASH_DUMP_SECONDS=-1, disabled disk and error watchdogs that were enabled during the previous experiment (disk watchdog does more harm than good in current state, I just forgot to disable it before). I was uploading in 6 threads, with the same properties (~70K objects from each thread, or 11 GB); 2-3% of this data already existed on the storage, these objects were not re-uploaded. Only wrote the new objects.

Interestingly enough, there were no restarts of storage nodes this time. The errors were completely different this time around; none of the problems from previous experiment happened here.

Problem (a funny case):

On client, I got

boto3.exceptions.S3UploadFailedError: Failed to upload /mnt/share2/storage/f9/95/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000.xz to body/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000.xz: An error occurred (AccessDenied) when calling the CompleteMultipartUpload operation: Access Denied

In gateway logs:

[W]	[email protected]	2017-05-01 17:37:28.153425 +0300	1493649448	leo_gateway_rpc_handler:handle_error/5	298	[{node,'[email protected]'},{mod,leo_storage_handler_object},{method,get},{cause,"Recover failure"}]

On storage_0:

error.20170501.17.1:[W]	[email protected]	2017-05-01 17:37:28.149035 +0300	1493649448	leo_storage_read_repairer:compare/4	165	[{node,'[email protected]'},{addr_id,285012524970156924905764134664763102054},{key,<<"body/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000.xz\n46f025cb3c617e50b1514988016e8f39">>},{clock,1493649447851454},{cause,not_found}]
error.20170501.17.1:[E]	[email protected]	2017-05-01 17:37:28.152321 +0300	1493649448	leo_storage_handler_object:get/3	149	[{from,gateway},{method,get},{key,<<"body/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000.xz\n46f025cb3c617e50b1514988016e8f39">>},{req_id,88456568},{cause,"Recover failure"}]

On storage_1:

error.20170501.17.1:[E]	[email protected]	2017-05-01 17:37:28.155367 +0300	1493649448	leo_storage_handler_object:put/4	391	[{from,storage},{method,delete},{key,<<"body/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000.xz\n">>},{req_id,41211863},{cause,not_found}]

Delete?? What does it mean? No one was doing "delete"...

@mocchira
Copy link
Member

mocchira commented May 3, 2017

@vstax Thanks for reporting in detail.

Experiment 1
Problem 1: around 80 object failed to upload. On client error was

In LeoFS, {cause,unavailable} only happen while reducing incoming requests after leo_watchdog got triggered for some reason and that unavailable error results in responding 503 to a client.
So there should be some logs that represent leo_watchdog triggered.
(if nothing, that's problem)

Problem 2: some timeouts and other strange errors. Example,

on gateway:

It seems the [email protected] network isolated from other nodes around that time.

On storage node

The first log in this three represents the root problem while others were caused by this first one.
Timeout in the first place can be caused

  • Network Congestion
  • The receiver communicating through leo_storage_replicator goes down (restarted)

In this case, the latter one seemed to happen.

What does "cause,unavailable" mean for PUT request, anyway?

As I commented above, this can happen while leo_watchdog reduces incoming requests.

One of the nodes also got these two messages in log:

This log is interesting because it seems the process related to log output (leo_logger) got overloaded.
At this time, Restarts on leo_storage might be caused by this log process instead of badarg from eleveldb. We will vet around logger.

Problem 3: storage nodes restarting under load.

As I mentioned above, Restating might be caused by leo_logger.

Problem (a funny case):

This seems multipart upload (cause multiple chunks being created on LeoFS) failed in the middle and deletes were caused to clean up all chunks partially uploaded so it looks fine to me.

However..

error.20170501.17.1:[W]	[email protected]	2017-05-01 17:37:28.149035 +0300	1493649448	leo_storage_read_repairer:compare/4	165	[{node,'[email protected]'},{addr_id,285012524970156924905764134664763102054},{key,<<"body/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000.xz\n46f025cb3c617e50b1514988016e8f39">>},{clock,1493649447851454},{cause,not_found}]
error.20170501.17.1:[E]	[email protected]	2017-05-01 17:37:28.152321 +0300	1493649448	leo_storage_handler_object:get/3	149	[{from,gateway},{method,get},{key,<<"body/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000.xz\n46f025cb3c617e50b1514988016e8f39">>},{req_id,88456568},{cause,"Recover failure"}]

This error looks wired.
As this file name includes \n, that means this file is a chunk of a large object in LeoFS.
In case the file is a chunk, the tail part followed by \n must be positive integer like

  • body/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000.xz\n1
    • means the first chunk of body/f9/95/c8/f995c8...

However in this case, the tail part followed by \n is n46f025cb3c617e50b1514988016e8f39 so obviously not_found error should happen with such file name.
We will vet why such suffix generated.

@vstax
Copy link
Contributor Author

vstax commented May 3, 2017

@mocchira Thank you for analyzing.

In LeoFS, {cause,unavailable} only happen while reducing incoming requests after leo_watchdog got triggered for some reason and that unavailable error results in responding 503 to a client.
So there should be some logs that represent leo_watchdog triggered.
(if nothing, that's problem)

This should be the case. In first experiment I had error and disk watchdog enabled (as well as debug logs). There were multiple cases of

info.20170421.02.1:[D]	[email protected]	2017-04-21 02:02:33.51663 +0300	1492729353	leo_storage_watchdog_sub:handle_notify/3	86	received an alart - id:leo_watchdog_sub_1, alarm:{watchdog_alarm,leo_watchdog_disk,{watchdog_state,leo_watchdog_disk,70,disk_util,[{disk_util,100.0}]},1492729348}
info.20170421.02.1:[D]	[email protected]	2017-04-21 02:02:38.52704 +0300	1492729358	leo_storage_watchdog_sub:handle_notify/3	86	received an alart - id:leo_watchdog_sub_1, alarm:{watchdog_alarm,leo_watchdog_disk,{watchdog_state,leo_watchdog_disk,70,disk_util,[{disk_util,100.0}]},1492729348}

Btw, does "error" watchdog react to errors like these? I have a lot of them in general since application logic now tries to download objects from LeoFS first before old storage, and not all data is moved to LeoFS. I assumed this is not a problem so filtered out all these messages from logs because there are so many of them.

[E]     [email protected]  2017-04-21 00:03:34.780440 +0300        1492722214      leo_storage_handler_object:get/1        84      [{from,storage},{method,get},{key,<<"body/f9/4f/56/f94f56db6bac9dea57250483eccb72a141dda0939fa049964a52e497180ff123e8df1b6051a732f7af86e9232eb28a10ced4000000000000">>},{cause,not_found}]
[E]     [email protected]  2017-04-21 00:03:34.832588 +0300        1492722214      leo_storage_handler_object:get/1        84      [{from,storage},{method,get},{key,<<"body/7f/15/b8/7f15b8702ab975296c0b10565c8feb10daad74d0246238412993eac109d73bce821ed4f5ac796dc77818485b6c15d960ced4000000000000">>},{cause,not_found}]

This explains why the second experiment - with all watchdogs except for "rex" disabled - had no errors like the first one. Point taken: either disable watchdog for these bulk load scenarios or get ready to retry a lot (and don't use current version of "disk watchdog", anyway).

It seems the [email protected] network isolated from other nodes around that time.

That is.. highly unlikely. All LeoFS nodes and the source system (from which I was uploading data) are VMs running on the same host and connected over virtual switch on that system. There is nothing relevant in "dmesg" for these VMs at that time as well.

This log is interesting because it seems the process related to log output (leo_logger) got overloaded.
At this time, Restarts on leo_storage might be caused by this log process instead of badarg from eleveldb. We will vet around logger.

Thanks, please do!
I had lots of logs (since debug logging was enabled). E.g. on storage_0 during these hours:

$ wc -l info.20170421.0[23]*
  178530 info.20170421.02.1
  122570 info.20170421.03.1

Still, 170K lines an hour doesn't exactly sound like insanely huge amounts. The log is not fsync'ed, right?

A bit offtopic, but since you mentioned badarg from leveldb: this was just a few minutes before second experiment, I've tried to do "s3cmd ls " on a bucket containing ~6K names. This failed (as expected, I guess, I just felt like trying), but for a minute when nodes were trying to process this request I had this

error.20170501.16.1:[E]	[email protected]	2017-05-01 16:39:08.868042 +0300	1493645948	leo_backend_db_eleveldb:first_n/2281	{badarg,[{eleveldb,async_iterator,[#Ref<0.0.17039361.23122>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,345}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,272}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,323}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}
error.20170501.16.1:[E]	[email protected]	2017-05-01 16:39:08.868490 +0300	1493645948	leo_mq_server:handle_call/3	287	{badarg,[{eleveldb,async_iterator,[#Ref<0.0.17039361.23122>,<<>>,[]],[]},{eleveldb,iterator,2,[{file,"src/eleveldb.erl"},{line,200}]},{leo_backend_db_eleveldb,fold,4,[{file,"src/leo_backend_db_eleveldb.erl"},{line,345}]},{leo_backend_db_eleveldb,first_n,2,[{file,"src/leo_backend_db_eleveldb.erl"},{line,272}]},{leo_backend_db_server,handle_call,3,[{file,"src/leo_backend_db_server.erl"},{line,323}]},{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,629}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,661}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,240}]}]}

and this in erlang.log

=====
===== LOGGING STARTED Mon May  1 16:39:18 MSK 2017
=====
Exec: /usr/local/leofs/current/leo_storage/erts-7.3/bin/erlexec -boot /usr/local/leofs/current/leo_storage/releases/1/leo_storage -mode minimal -config /etc/leofs/leo_storage/app.config -args_file /etc/leofs/leo_storage/vm.args -- console
Root: /usr/local/leofs/current/leo_storage
Erlang/OTP 18 [erts-7.3] [source-d2a6d81] [64-bit] [smp:8:8] [async-threads:32] [hipe] [kernel-poll:true]


=INFO REPORT==== 1-May-2017::16:39:19 ===
    alarm_handler: {set,{system_memory_high_watermark,[]}}

The second log is the one that seems interesting. This type of load managed to hit "high watermark"; memory usage was fine all around that time, e.g. from sar:

00:00:01    kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
16:20:01       363604   7646400     95,46    223224   5917488   1164408      9,54   3090336   3324080       340
16:30:01       366972   7643032     95,42    223224   5917496   1162644      9,53   3088996   3324080         0
16:40:01       448456   7561548     94,40    224868   5883596   1149868      9,42   3145420   3184900       352

Since there was no other IO load at that moment which could fill the cache (I started upload experiment 7 minutes later or so), process couldn't have allocated memory when that message happened: as you can see, buffers+cached is well over 6GB before and after it happened, if the process had allocated memory (and then free'ed it or restarted), at 16:40 there would've been less memory allocated to cache. In other words, the "system_memory_high_watermark" message happened without any actual memory usage by process. I've observed similar problems in experiments with NFS gateway in the past.

However in this case, the tail part followed by \n is n46f025cb3c617e50b1514988016e8f39 so obviously not_found error should happen with such file name.
We will vet why such suffix generated.

Yes, this looks suspicious. Just in case, I've verified that I have no objects that have 46f025cb3c617e50b1514988016e8f39 as part of their name. The object itself is 6.6 MB, so there should've been two chunks. This string is 32 characters - ETag, maybe? Well it doesn't match with md5 of whole file, md5 of first/second chunk or the ETag of this file.

The AccessDenied I got on a client is a problem as well: the fact that it logged a failed clean-up attempt is one thing, but there has to be a reason why the error happened in the first place...

Also, despite this AccessDenied, HEAD had shown that this object exist in cluster. I deleted it and tried to re-upload with the same tool - this time it uploaded fine.

@vstax
Copy link
Contributor Author

vstax commented May 3, 2017

Just wanted to add that I've performed this experiment once on brand-new 1.3.4 with debug logs disabled and disk & error watchdogs enabled (for the sake of experiment) and had 0 problems, I even switched load from 10 processes to 16 in the middle but still all objects (~1M, around 140GB in size) have been uploaded without any errors.

Logs on gateway and storage nodes are all clean except for these messages from watchdog, but they did not affect the gateway

[W]	[email protected]	2017-05-04 00:38:01.11789 +0300	1493847481	leo_watchdog_disk:disk_stats_1/2	406	[{triggered_watchdog,disk_throughput},{disk_rkb,12.0},{disk_wkb,350464.0}]
[W]	[email protected]	2017-05-04 00:38:31.31775 +0300	1493847511	leo_watchdog_disk:disk_stats_1/2	406	[{triggered_watchdog,disk_throughput},{disk_rkb,20.0},{disk_wkb,369212.0}]
[W]	[email protected]	2017-05-04 00:39:01.64387 +0300	1493847541	leo_watchdog_disk:disk_stats_1/2	406	[{triggered_watchdog,disk_throughput},{disk_rkb,12.0},{disk_wkb,342004.0}]
[W]	[email protected]	2017-05-04 00:49:21.622622 +0300	1493848161	leo_watchdog_disk:disk_stats_1/2	387	[{triggered_watchdog,disk_util},{disk_util,100.0}]
[W]	[email protected]	2017-05-04 00:50:21.682875 +0300	1493848221	leo_watchdog_disk:disk_stats_1/2	387	[{triggered_watchdog,disk_util},{disk_util,100.0}]
[W]	[email protected]	2017-05-04 01:11:12.909000 +0300	1493849472	leo_watchdog_disk:disk_stats_1/2	406	[{triggered_watchdog,disk_throughput},{disk_rkb,0.0},{disk_wkb,199172.0}]
[W]	[email protected]	2017-05-04 01:48:05.282170 +0300	1493851685	leo_watchdog_disk:disk_stats_1/2	406	[{triggered_watchdog,disk_throughput},{disk_rkb,0.0},{disk_wkb,207936.0}]

I think this adds to the suspicions that logger might be causing some problems. I'll try to repeat the experiment a few more times to maybe get that AccessDenied problem on multipart object.

@mocchira
Copy link
Member

mocchira commented May 4, 2017

@vstax

Thanks for the quick validation.

I think this adds to the suspicions that logger might be causing some problems. I'll try to repeat the experiment a few more times to maybe get that AccessDenied problem on multipart object.

Yes I think so.
The large volume log output can cause some problem.
As I mentioned prev comment, we will vet around leo_logger.

Note on why upgrading leo_watchdog makes logs cleaner than before.
My guess is leo-project/leo_watchdog@8a30a17 might contribute to get erlang less overloaded under a high load.
This is one of the fixes for #653 so at your experiment on 1.3.4 with watchdogs enabled and leofs under a high load, this fix would help leofs not get into high load and result in less errors/complete uploads.
We will consider labeling #653 more high priority task and applying the same hack across the whole leofs source tree.

The below are answers to your questions.

does "error" watchdog react to errors like these?

No. not_found is excluded from targets "error" watchdog monitors.

That is.. highly unlikely. All LeoFS nodes and the source system (from which I was uploading data) are VMs running on the same host and connected over virtual switch on that system. There is nothing relevant in "dmesg" for these VMs at that time as well.

Got it. This might be explained with the above comment (leofs can be overloaded without leo-project/leo_watchdog@8a30a17)

Still, 170K lines an hour doesn't exactly sound like insanely huge amounts. The log is not fsync'ed, right?

Right. 170K lines an hour looks sane and logs is not fsync'ed.

A bit offtopic, but since you mentioned badarg from leveldb: this was just a few minutes before second experiment, I've tried to do "s3cmd ls " on a bucket containing ~6K names. This failed (as expected, I guess, I just felt like trying), but for a minute when nodes were trying to process this request I had this

This looks another kind of badarg from leveldb. I will vet.

Since there was no other IO load at that moment which could fill the cache (I started upload experiment 7 minutes later or so), process couldn't have allocated memory when that message happened: as you can see, buffers+cached is well over 6GB before and after it happened, if the process had allocated memory (and then free'ed it or restarted), at 16:40 there would've been less memory allocated to cache. In other words, the "system_memory_high_watermark" message happened without any actual memory usage by process. I've observed similar problems in experiments with NFS gateway in the past.

There are some cases that erlang runtime malloc'ed memory regions for their own heap from an OS and NOT gave those back to the OS even when those got unnecessary.
That said, system_memory_high_watermark doesn't depend on the actual memory usage on an OS but on the heap managed by Erlang Runtime itself.

Yes, this looks suspicious. Just in case, I've verified that I have no objects that have 46f025cb3c617e50b1514988016e8f39 as part of their name. The object itself is 6.6 MB, so there should've been two chunks. This string is 32 characters - ETag, maybe? Well it doesn't match with md5 of whole file, md5 of first/second chunk or the ETag of this file.
The AccessDenied I got on a client is a problem as well: the fact that it logged a failed clean-up attempt is one thing, but there has to be a reason why the error happened in the first place...

WIP.

@mocchira
Copy link
Member

@vstax As I've found almost everything that can explain what you've faced, I will share below one by one.

I think this adds to the suspicions that logger might be causing some problems.

Filed as another issue on #729.

The AccessDenied I got on a client is a problem as well:

Filed as another issue on #730.

This looks another kind of badarg from leveldb. I will vet.

It turned out that the badarg from leveldb at this time could happen around when leo_storage restarted so that this badarg itself isn't problem.

As this file name includes \n, that means this file is a chunk of a large object in LeoFS.
In case the file is a chunk, the tail part followed by \n must be positive integer

Sorry. this is wrong.
There is a case the tail part followed by \n can be 32 bytes hex digit characters when multipart upload is used (32 bytes chars denotes Upload ID) so that this kind of file name itself isn't problem.

Now As the root causes have been covered on #729, #730 and also #653,
Can you close this issue if you don't find something uncovered here?

@vstax
Copy link
Contributor Author

vstax commented May 10, 2017

@mocchira Thank you for investigation!
I'd like to do a few more experiments for now; I didn't do any after "delete bucket" experiment since these "stalled queues" have confused me, but now that the cause is found I can do some more tries. I'll close the ticket if there are no new problems.

The one that really bothers me is AccessDenied problem. Not the error code itself, but errors from failed cleanup attempt and the fact that HEAD request was showing that the object is in storage and whereis command was showing that as well. The code uploading that object didn't do any retries, and no one else was uploading that object. I got (maybe false) suspicion that as a result of some errors during uploading or errors during cleanup attempt of parts of multipart object (when upload fails due to errors and gateway tries to cleanup) something is left on storage. The fact that object didn't upload properly (as there was cleanup try and everything), but in fact HEAD after that error was showing that it is present is alarming, it can break application logic.
There was a ticket #561 which just got closed for some reason (it's done? or it won't be done?); could it be that the problem behind that ticket combined with cleanup attempts, or errors during cleanup attempts when there are problems with multipart objects cause inconsistency on storage? Not one-time logical inconsistency between storage and application like from #561, but real inconsistency, as if some parts are present while others were not, or object header present while its parts are not.
(like, IDK, with W=2 enough nodes were present during uploading of first parts, but not enough during uploading of later parts. Though this isn't about this situation since the experiment was running with W=1, something else has happened here).

Unfortunately, after looking at HEAD result and whereis output I was too quick to delete (remains of that object?) on storage and re-upload it again, which was a mistake. I hope to see this problem once more to check out if the data is broken on storage or not. However, given that it happened once per 2M uploads and there were no other errors during that experiment, I don't know what are chances of encountering it again.

@mocchira
Copy link
Member

@vstax

Prerequisite

  • It turned out that almost every S3 clients call the multipart abort request when something goes wrong during a multipart upload process (initiate, upload parts multiple times and finally complete|abort) to clean up every temporal parts.
    In LeoFS, the abort request cause https://github.com/leo-project/leofs/blob/1.3.4/apps/leo_gateway/src/leo_gateway_s3_api.erl#L945-L956 code block to run.

  • Eventual Consistency
    That means the read(head/get) can be succeeded even if it's called after confirming the delete to the same object succeeded. Note: The Read After Write consistency (RAW) AWS S3 provides grantees this couldn't happen however LeoFS doesn't at the moment.

  • Last Write Win (LWW)
    All replicas stored in LeoFS have been maintained up-to-date based on its most recent timestamp in accordance with the eventual consistency manner.

Answers

but in fact HEAD after that error was showing that it is present is alarming, it can break application logic.

Unfortunately it could happen due to the inherent nature of the eventual consistency regardless of its operations whether it's uploaded through multipart or singlepart.

My guess how it happened on your case.

  1. S3 Client sent the multipart upload completion request to leo_gateway
  2. The multipart upload completion request succeeded on the one of leo_storages (the object got to be available here)
  3. However leo_gateway failed to receive the success reply from leo_storages
  4. S3 Client received the error reply from leo_gateway
  5. S3 Client sent the multipart upload abort request to leo_gateway
  6. The multipart upload abort request succeeded on the one of leo_storages (the availability of the object was ambiguous because the one of them was deleted and the other was available here)
  7. S3 Client reported the multipart upload failed
  8. S3 Client sent the head request and got the 200 reply (because one of them could be still available here)
  9. The object should be deleted on both replicas eventually.

There was a ticket #561 which just got closed for some reason (it's done? or it won't be done?);

It won't be done because it's spec at the moment.
However IMHO the RAW consistency grantee could be arguable to make application logic understandable.
Let me know what you would think about it.

could it be that the problem behind that ticket combined with cleanup attempts, or errors during cleanup attempts when there are problems with multipart objects cause inconsistency on storage? Not one-time logical inconsistency between storage and application like from #561, but real inconsistency, as if some parts are present while others were not, or object header present while its parts are not.
(like, IDK, with W=2 enough nodes were present during uploading of first parts, but not enough during uploading of later parts. Though this isn't about this situation since the experiment was running with W=1, something else has happened here).

If I understand what you asked correctly,
Yes there is #719 so any update to a large object can be not atomic (In other(your) words, the object header present while its part are not could happen).

@vstax
Copy link
Contributor Author

vstax commented May 11, 2017

@mocchira
Thank you for detailed description. Some comments:

It turned out that almost every S3 clients call the multipart abort request when something goes wrong during a multipart upload process (initiate, upload parts multiple times and finally complete|abort) to clean up every temporal parts.

Interesting; I thought this is to be handled on server side, not on client. Does that mean that if network failure has occured between client and the gateway, or if client just died (OOM or power outage etc.), the remains of objects (some parts and such) may remain on LeoFS? In more detail:

Everything will be fine if the same object is uploaded again. However, I got questions about other cases, when original client didn't re-upload the object.

  1. Let's suppose the client couldn't finish cleanup - it just died during upload of multipart object. Do already uploaded object chunks remain on LeoFS under some condition, i.e. won't be removed during compaction?
  2. If some client does "GET" for such object remains, will LeoFS detect inconsistency and perform the clean-up instead of returning something broken to the client?
  3. If some client does "HEAD" before "GET" for such object, can it get information that object exists on a storage? I don't mean the fact that it can happen for some time due to eventual consistency. I mean as a stable case - client does "HEAD" again and again and always gets the object exists, even though it's impossible to retrieve it though "GET" because it's broken. But if no one does "GET", "PUT" or "DELETE" on it, then "HEAD" will return that object exists forever. If this is somehow possible, then "HEAD" is not to be trusted..
  4. If another, smaller (non-multipart, or one that has less parts than original one) object with the same name is uploaded, can already uploaded chunks of previous object remain on storage? (this question only makes sense if answer to the first question is "yes")

but in fact HEAD after that error was showing that it is present is alarming, it can break application logic.

Unfortunately it could happen due to the inherent nature of the eventual consistency regardless of its operations whether it's uploaded through multipart or singlepart.

You're absolutely right. It's my bad, I forgot about this detail for a moment and asked a wrong question. Yes, eventual consistency is no problem and can be handled in application designed with that in mind (also there is always an option of having R+ (W or D) > N, plus disabling LeoGateway cache in case of LeoFS to get strong consistency).

Anyhow, my question was wrong, "HEAD" or "GET" working right after "DELETE" won't break application that can handle eventual consistency. What I really meant to ask was question №2 from the above. If it's possible to be getting "object exists" from "HEAD" over and over even when cluster seems fully consistent and nothing was broken on storage nodes or in queues. In other words, stable inconsistency between object presence flag and being able to actually get that object. This is only relevant for broken multipart upload because of trouble between client and gateway (my understanding is that it's impossible to get such inconsistency for single-part object, as long as you rule out things like gateway cache, at least).

S3 Client sent the head request and got the 200 reply (because one of them could be still available here)
The object should be deleted on both replicas eventually.

Why would it? After I did "HEAD" I executed "whereis" command and "del" flag wasn't present on any of two replicas. So it wouldn't be deleted by itself, I think. I had to do "DELETE" myself and re-upload it after that...

It won't be done because it's spec at the moment.
However IMHO the RAW consistency grantee could be arguable to make application logic understandable.

Hmm but they are about somewhat different things? RAW is about getting strong consistency in the most important place to make application logic easier, while #561 is about writes ignoring W setting internally - application thinks that write didn't succeed while in reality it did. There is similarity between them, but use case is different: RAW is important for normal cluster operations while #561 is needed to get rid of problems when cluster is in degraded state..

Well, no matter: they both can be worked around (one can live without RAW by passing ETag together with object name between clients, and without #561 by always re-uploading object or checking its presence with HEAD after error). And even if #561 to be implemented, there are some theoretical cases when client gets an error after successful upload, even though that would be rare..
I'm more interested in possible bug in current implementation, which - if present - can make that workaround for #561 impossible.

If I understand what you asked correctly,
Yes there is #719 so any update to a large object can be not atomic (In other(your) words, the object header present while its part are not could happen).

Actually, I meant initial upload of multipart object, not its update as described in #719. In this test, the logic was strict:

  1. Check for object presence in LeoFS with HEAD
  2. If it doesn't exist, upload

after it failed with AccessDenied and getting these errors in logs from clean-up attempt, I've executed it again hours later and it said that object does exist. I've executed "whereis" and object was not deleted. So I had to delete it, check with "whereis" that it's marked as deleted on both nodes and launched the application that uploaded this object again, this time without problems. I mentioned #561 only as a possible thought of what might've been going on behind the scenes; this test does not rely on whether problem from #561 is present or not. I'm bothered because there clearly was not only an error to client, but clean-up attempt as well, so why something would be present after that is beyond me.

I understand that this is very theoretical, I should've gathered more information about that object before removing it. Which is why I hope to encounter that problem again...

@mocchira
Copy link
Member

@vstax thanks for your insightful reply.

Interesting; I thought this is to be handled on server side, not on client. Does that mean that if network failure has occured between client and the gateway, or if client just died (OOM or power outage etc.), the remains of objects (some parts and such) may remain on LeoFS? In more detail:

Yes and the original AWS S3 also does the same behavior (rely on clients by default and you have to pay the money for storing such garbage parts!).
However S3 provides http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html#mpu-abort-incomplete-mpu-lifecycle-config to mitigate such risk as much as possible.
We may have to consider such garbage collector to implement.

Let's suppose the client couldn't finish cleanup - it just died during upload of multipart object. Do already uploaded object chunks remain on LeoFS under some condition, i.e. won't be removed during compaction?

If its delete flag is not set,
Yes remained and won't be removed during compaction.

If some client does "GET" for such object remains, will LeoFS detect inconsistency and perform the clean-up instead of returning something broken to the client?

This behavior depends on how much progress there is.
The below cases I come up with can happen.

  1. A parent object is still NOT updated (complete/abort don't reach leo_storage(s)) and there are only children (parts) objects.
    leo_gateway return 404 (and parts remained).

  2. A parent object is updated with complete on leo_storage(s) and there are all children that are supposed to exist.
    leo_gateway return 200 if consistency.R is satisfied or 500 if not and subsequently read-repair will happen.

  3. A parent object is updated with abort on leo_storage(s).
    leo_gateway return 404 (and parts deleted eventually in the background).

so there should be no cases returning something broken to the client however as I said above, garbage parts can be remained.

If some client does "HEAD" before "GET" for such object, can it get information that object exists on a storage? I don't mean the fact that it can happen for some time due to eventual consistency. I mean as a stable case - client does "HEAD" again and again and always gets the object exists, even though it's impossible to retrieve it though "GET" because it's broken. But if no one does "GET", "PUT" or "DELETE" on it, then "HEAD" will return that object exists forever. If this is somehow possible, then "HEAD" is not to be trusted..

As a stable case and besides #719 then there should be no chance.

If another, smaller (non-multipart, or one that has less parts than original one) object with the same name is uploaded, can already uploaded chunks of previous object remain on storage? (this question only makes sense if answer to the first question is "yes")

Removed.

Anyhow, my question was wrong, "HEAD" or "GET" working right after "DELETE" won't break application that can handle eventual consistency. What I really meant to ask was question №2 from the above. If it's possible to be getting "object exists" from "HEAD" over and over even when cluster seems fully consistent and nothing was broken on storage nodes or in queues. In other words, stable inconsistency between object presence flag and being able to actually get that object. This is only relevant for broken multipart upload because of trouble between client and gateway (my understanding is that it's impossible to get such inconsistency for single-part object, as long as you rule out things like gateway cache, at least).

Got it.

stable inconsistency between object presence flag and being able to actually get that object

The stable inconsistency should not happen even it's handled through multipart.

Why would it? After I did "HEAD" I executed "whereis" command and "del" flag wasn't present on any of two replicas. So it wouldn't be deleted by itself, I think. I had to do "DELETE" myself and re-upload it after that...

Sorry the last paragraph was wrong (I missed the situation you actually faced that the object existed hours later).
We had no idea at the moment however theoretically the object should be deleted eventually if you didn't touch (get or any update) the object so I guess there might be something.
I will vet in detail.

I'm more interested in possible bug in current implementation, which - if present - can make that workaround for #561 impossible.

Got it.

I understand that this is very theoretical, I should've gathered more information about that object before removing it. Which is why I hope to encounter that problem again..

Hope too and please share if you have a luck.

@mocchira
Copy link
Member

@vstax

I think the stable inconsistency problem here would be explained by #735 .
Let me know if you think different.

@vstax
Copy link
Contributor Author

vstax commented May 18, 2017

@mocchira Thank you for analyzing this problem. I see that you've found some pretty well hidden bugs! Impressive.

I don't really know what happened there; other than logs from gateway & storage_1/storage_2 that I've shown before I only got application logs, and they don't show much more than "AccessDenied" (well, I also know that it happened between 17:37:27 - last success operation and 17:37:30 - start of next operation), but that doesn't tell anything new. I had retries completely disabled in boto3, it's configured to fail after single operation got wrong (for our applications, it's better to get error as fast as possible, than bear with increasing timeouts that boto3 and many other S3 clients do between retries). There obviously was no network failure, as for watchdog, well, there are no messages in info/error log related to that (or any other messages in logs around that time, for the matter) on any node or gateway, but as you mention, it could be something else that maybe was not logged.
(just in case - there is no way to officially disable retries in boto3 now, so it was done through a hack like

_handler = s3client.meta.events._unique_id_handlers['retry-config-s3']['handler']
_handler._checker.__dict__['_max_attempts'] = self.max_attempts

as per boto/botocore#882)

Anyhow, after the latest pull request is merged I'll try to do a few test with both uploads and deletes and share my results.

Also, I have this idea: there is some code that returned AccessDenied (#730); would it be possible to make a hack that this "CompleteMultipartUpload" operation on gateway always switches to that logic and returns 403 to client, then fails whenever CompleteMultipartUpload is called? With a few experiments (by moving code block that returns that error around so it can be tried before or after actual logic that handles this CompleteMultipartUpload call) it should be possible to trigger this; the situation will be a bit different, but I think that it's likely to see how client and gateway behaves then and maybe even see the same exact problem. boto3+botocore is quite a bit over-engineered so I don't know how to track its logic by looking at its code...
If you think that making such hack should be possible and can spare some time for that, I can do the testing (I should be able to handle moving the code that does this around to check a few different situations)...

@mocchira
Copy link
Member

@vstax

Also, I have this idea: there is some code that returned AccessDenied (#730); would it be possible to make a hack that this "CompleteMultipartUpload" operation on gateway always switches to that logic and returns 403 to client, then fails whenever CompleteMultipartUpload is called? With a few experiments (by moving code block that returns that error around so it can be tried before or after actual logic that handles this CompleteMultipartUpload call) it should be possible to trigger this; the situation will be a bit different, but I think that it's likely to see how client and gateway behaves then and maybe even see the same exact problem. boto3+botocore is quite a bit over-engineered so I don't know how to track its logic by looking at its code...

Sure. I will make a branch for that and share here later.

boto3+botocore is quite a bit over-engineered so I don't know how to track its logic by looking at its code...

Absolutely ;).

@mocchira
Copy link
Member

@vstax Sorry for the long delay.

The below patch is what you want.

diff --git a/apps/leo_gateway/src/leo_gateway_s3_api.erl b/apps/leo_gateway/src/leo_gateway_s3_api.erl
index e330cd3..8c155e3 100644
--- a/apps/leo_gateway/src/leo_gateway_s3_api.erl
+++ b/apps/leo_gateway/src/leo_gateway_s3_api.erl
@@ -959,20 +959,26 @@ handle_2({ok,_AccessKeyId}, Req, ?HTTP_DELETE,_Key,

 %% For Multipart Upload - Completion
 handle_2({ok,_AccessKeyId}, Req, ?HTTP_POST,_Key,
-         #req_params{bucket_info = BucketInfo,
+         #req_params{bucket_info = _BucketInfo,
                      path = Path,
-                     chunked_obj_len = ChunkedLen,
+                     chunked_obj_len = _ChunkedLen,
                      is_upload = false,
                      upload_id = UploadId,
                      upload_part_num = PartNum,
                      transfer_decode_fun = TransferDecodeFun,
                      transfer_decode_state = TransferDecodeState}, State) when UploadId /= <<>>,
                                                                                PartNum == 0 ->
-    Res = cowboy_req:has_body(Req),
-
-    {ok, Req_2} = handle_multi_upload_1(
-                    Res, Req, Path, UploadId,
-                    ChunkedLen, TransferDecodeFun, TransferDecodeState, BucketInfo),
+    BodyOpts = case TransferDecodeFun of
+                   undefined ->
+                       [];
+                   _ ->
+                       [{transfer_decode, TransferDecodeFun, TransferDecodeState}]
+               end,
+    _Res = cowboy_req:has_body(Req),
+    Ret = cowboy_req:body(Req, BodyOpts),
+    {ok, Req_2} =
+        ?reply_forbidden([?SERVER_HEADER], ?XML_ERROR_CODE_AccessDenied,
+                          ?XML_ERROR_MSG_AccessDenied, Path, <<>>, Req),
     {ok, Req_2, State};

let me know if you find something.

@vstax
Copy link
Contributor Author

vstax commented May 25, 2017

@mocchira Thank you!
Unfortunately, nothing so far.
Brief log of boto3 with retries set to 1 with this code:

<- POST /bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz?uploads HTTP/1.1
-> HTTP/1.1 200 OK
-> HTTP/1.1 100 Continue
<- (binary data)
<- PUT /bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz?partNumber=1&uploadId=80fd0c357723e9434f5a81d751842e78 HTTP/1.1
-> HTTP/1.1 200 OK
-> HTTP/1.1 100 Continue
<- (binary data)
-> HTTP/1.1 200 OK
<- POST /bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz?uploadId=80fd0c357723e9434f5a81d751842e78 HTTP/1.1
-> HTTP/1.1 403 Forbidden
<- DELETE /bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz?uploadId=80fd0c357723e9434f5a81d751842e78 HTTP/1.1
-> HTTP/1.1 204 No Content

boto3 error is exactly the same:

boto3.exceptions.S3UploadFailedError: Failed to upload f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz to bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz: An error occurred (AccessDenied) when calling the CompleteMultipartUpload operation: Access Denied

so we can assume that this single DELETE request was performed in the last case as well.

An error logged on storage_0:

[E] [email protected]  2017-05-25 22:52:27.341085 +0300    1495741947  leo_storage_handler_object:put/4    416 [{from,storage},{method,delete},{key,<<"bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz\n">>},{req_id,12622454},{cause,not_found}]

But the object is correctly deleted:

# /usr/local/leofs/current/leofs-adm -p 10011 whereis bodytest/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000__.xz
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |            node             |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
  *    | [email protected]      | 4719bdbe5de0de058f49ac3f562243f9     |         0B |   d41d8cd98f | false          |              0 | 5505e8f723068  | 2017-05-25 22:52:27 +0300
  *    | [email protected]      | 4719bdbe5de0de058f49ac3f562243f9     |         0B |   d41d8cd98f | false          |              0 | 5505e8f723068  | 2017-05-25 22:52:27 +0300

For comparison, in original case the logs of storage_0, gateway and storage_1 were (same ones as posted before, just summarizing here):

[W]     [email protected]  2017-05-01 17:37:28.149035 +0300        1493649448      leo_storage_read_repairer:compare/4     165     [{node,'[email protected]'},{addr_id,285012524970156924905764134664763102054},{key,<<"body/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000.xz\n46f025cb3c617e50b1514988016e8f39">>},{clock,1493649447851454},{cause,not_found}]
[E]     [email protected]  2017-05-01 17:37:28.152321 +0300        1493649448      leo_storage_handler_object:get/3        149     [{from,gateway},{method,get},{key,<<"body/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000.xz\n46f025cb3c617e50b1514988016e8f39">>},{req_id,88456568},{cause,"Recover failure"}]

[W]     [email protected]  2017-05-01 17:37:28.153425 +0300        1493649448      leo_gateway_rpc_handler:handle_error/5  298     [{node,'[email protected]'},{mod,leo_storage_handler_object},{method,get},{cause,"Recover failure"}]

[E]     [email protected]  2017-05-01 17:37:28.155367 +0300        1493649448      leo_storage_handler_object:put/4        391     [{from,storage},{method,delete},{key,<<"body/f9/95/c8/f995c8b60e77fe7a79658fd7dd4169ecf5eaabf6662796d8b6eef323c8c044e69e683a09ddee733409265144bccacd7778597a0000000000.xz\n">>},{req_id,41211863},{cause,not_found}]

The last error is the same as in this experiment; just like in that one, it's logged only on secondary node (which is storage_0 for this file name but was storage_1 for the original one). The errors before that should be unrelated to DELETE.. which doesn't explain why DELETE had failed to actually delete the object and doesn't lead anywhere.

I'll try to think of something else, and perform some high-load experiments next week.

@mocchira
Copy link
Member

@vstax thanks!

FWIW, let me share one thing.

The errors before that should be unrelated to DELETE.. which doesn't explain why DELETE had failed to actually delete the object and doesn't lead anywhere.

Yes that should be unrelated to DELETE itself and also doesn't explain why it had failed.
However the errors before is the root cause that the DELETE happened due to the no file suffixed by UploadID (that means the multipart upload complete request proceeded while leo_gateway responded 403 to a client).

and... I noticed while writing this reply.

There might be the case

  • the multipart upload complete request proceeded at a server side behind the scene
  • the multipart abort request issued from a client

Those operation can be inverted chronologically if proceeding the complete request delayed somehow because those operation got totally independent (not having any causality).

This case can be mitigated by #736 however there are other corner cases such inverted operations could happen so I will keep vetting.
Also please share your thought if you find something related to this topic.

@vstax
Copy link
Contributor Author

vstax commented Jun 6, 2017

@mocchira
I did some more upload tests - on proper version, it was 1.3.5-rc1 + very latest leo_mq and leo_watchdog, but without #752 change (because, well, for this test it's unimportant what's returned, the important question is why it is returned). I've managed to encountered this AccessDenied problem again, but it happened a bit differently this time. No other problems at all during 5+ hours of non-stop parallel uploads.

The problem causing AccessDenied happened twice, for different clients, within 2 minutes range. At that point there was higher load on storage system hosting these test VMs; I'm pretty confident that original cause that eventually led to this situation is high IO load. However, even though disk watchdog was triggering from time to time (utilization and throughput), it did not write anything in logs right when the problem happened (I'm showing excerpt from logs with some unrelated lines - the ones that were before and after the time of the problem). Another interesting thing is that only one of these two problems is logged on gateway; there must be some difference, that is, first situation apparently was considered more "normal" by gateway, even though the end result was the same for client.

"Object not deleted" bug did not happen this time. Both objects are correctly deleted on storage. I think that it went through code paths that caused the similar problem, but cleaned up correctly this time. Still no idea what happened the last time. Some race condition? Error during error handling / cleanup?

Anyhow, here are logs. I've merged info and error logs for storage node. Please note that some is unrelated - e.g. some memory watermark was hit 2 minutes after the problem. However, it's still a bit strange because it did not happen any more during next few hours of stress testing. The cluster was completely wiped for this test, and this AccessDenied problem actually happened within few minutes after start of the test, when cluster held no more than 10 GB of data. Autocompaction is disabled. Well, like I said above, I'm sure that higher load on storage system at that moment happened an it's likely to be related.

Clients:

2017-06-05 21:54:06 boto3.exceptions.S3UploadFailedError: Failed to upload /mnt/share2/storage/60/51/60513d5c454f467f05ab51fe8c527e35649a5bd1b17b094f4c49ffea46cc4b4aed27a26210327518e0cfd98495af1f027e56970000000000.xz to bodytest/60/51/3d/60513d5c454f467f05ab51fe8c527e35649a5bd1b17b094f4c49ffea46cc4b4aed27a26210327518e0cfd98495af1f027e56970000000000.xz: An error occurred (AccessDenied) when calling the CompleteMultipartUpload operation: Access Denied

2017-06-05 21:56:05 boto3.exceptions.S3UploadFailedError: Failed to upload /mnt/share2/storage/90/81/90819974e82b6889c5efef7ac337c19aa7be9c49df92bfa3b03c55c578f60c5e8fb403baf2b7de9a4726d4bd3103b67af83b650000000000.xz to bodytest/90/81/99/90819974e82b6889c5efef7ac337c19aa7be9c49df92bfa3b03c55c578f60c5e8fb403baf2b7de9a4726d4bd3103b67af83b650000000000.xz: An error occurred (AccessDenied) when calling the CompleteMultipartUpload operation: Access Denied

Gateway:

[W]	[email protected]	2017-06-05 21:56:05.872319 +0300	1496688965	leo_gateway_rpc_handler:handle_error/5	298	[{node,'[email protected]'},{mod,leo_storage_handler_object},{method,get},{cause,"Recover failure"}]

storage_0:

[I]	[email protected]	2017-06-05 21:58:37.412132 +0300	1496689117	null:null	0	["alarm_handler",58,32,"{set,{system_memory_high_watermark,[]}}"]

storage_1:

[W]	[email protected]	2017-06-05 21:34:49.848292 +0300	1496687689	leo_watchdog_disk:disk_stats_1/2	387	[{triggered_watchdog,disk_util},{disk_util,90.9}]
[E]	[email protected]	2017-06-05 21:54:06.182428 +0300	1496688846	leo_storage_handler_object:get/1	89	[{from,storage},{method,get},{key,<<"bodytest/60/51/3d/60513d5c454f467f05ab51fe8c527e35649a5bd1b17b094f4c49ffea46cc4b4aed27a26210327518e0cfd98495af1f027e56970000000000.xz\n356d4977cd5f9741083c7a991ea02e3b">>},{cause,not_found}]
[E]	[email protected]	2017-06-05 21:54:06.194017 +0300	1496688846	leo_storage_handler_object:put/4	416	[{from,storage},{method,delete},{key,<<"bodytest/60/51/3d/60513d5c454f467f05ab51fe8c527e35649a5bd1b17b094f4c49ffea46cc4b4aed27a26210327518e0cfd98495af1f027e56970000000000.xz\n">>},{req_id,115857409},{cause,not_found}]
[W]	[email protected]	2017-06-05 21:56:05.859868 +0300	1496688965	leo_storage_read_repairer:compare/4	165	[{node,'[email protected]'},{addr_id,92693022232580438140728862238119976981},{key,<<"bodytest/90/81/99/90819974e82b6889c5efef7ac337c19aa7be9c49df92bfa3b03c55c578f60c5e8fb403baf2b7de9a4726d4bd3103b67af83b650000000000.xz\n3222353f1cef553ff7cfb62a84f3a141">>},{clock,1496688965354819},{cause,not_found}]
[E]	[email protected]	2017-06-05 21:56:05.862211 +0300	1496688965	leo_storage_handler_object:get/3	156	[{from,gateway},{method,get},{key,<<"bodytest/90/81/99/90819974e82b6889c5efef7ac337c19aa7be9c49df92bfa3b03c55c578f60c5e8fb403baf2b7de9a4726d4bd3103b67af83b650000000000.xz\n3222353f1cef553ff7cfb62a84f3a141">>},{req_id,80325131},{cause,"Recover failure"}]
[E]	[email protected]	2017-06-05 21:56:05.879601 +0300	1496688965	leo_storage_handler_object:put/4	416	[{from,storage},{method,delete},{key,<<"bodytest/90/81/99/90819974e82b6889c5efef7ac337c19aa7be9c49df92bfa3b03c55c578f60c5e8fb403baf2b7de9a4726d4bd3103b67af83b650000000000.xz\n">>},{req_id,22187278},{cause,not_found}]
[I]	[email protected]	2017-06-05 21:57:38.689050 +0300	1496689058	null:null	0	["alarm_handler",58,32,"{set,{system_memory_high_watermark,[]}}"]
[W]	[email protected]	2017-06-05 21:58:20.941571 +0300	1496689100	leo_watchdog_disk:disk_stats_1/2	406	[{triggered_watchdog,disk_throughput},{disk_rkb,8.0},{disk_wkb,209140.0}]

storage_2

[W]	[email protected]	2017-06-05 21:52:01.229374 +0300	1496688721	leo_watchdog_disk:disk_stats_1/2	387	[{triggered_watchdog,disk_util},{disk_util,100.0}]
[W]	[email protected]	2017-06-05 21:52:31.261130 +0300	1496688751	leo_watchdog_disk:disk_stats_1/2	406	[{triggered_watchdog,disk_throughput},{disk_rkb,0.0},{disk_wkb,269612.0}]
[W]	[email protected]	2017-06-05 21:54:06.183677 +0300	1496688846	leo_storage_read_repairer:compare/4	165	[{node,'[email protected]'},{addr_id,211487923316672901417193183379909812883},{key,<<"bodytest/60/51/3d/60513d5c454f467f05ab51fe8c527e35649a5bd1b17b094f4c49ffea46cc4b4aed27a26210327518e0cfd98495af1f027e56970000000000.xz\n356d4977cd5f9741083c7a991ea02e3b">>},{clock,1496688845526322},{cause,not_found}]
[I]	[email protected]	2017-06-05 21:58:39.762525 +0300	1496689119	null:null	0	["alarm_handler",58,32,"{set,{system_memory_high_watermark,[]}}"]
[W]	[email protected]	2017-06-05 22:06:22.51799 +0300	1496689582	leo_watchdog_disk:disk_stats_1/2	406	[{triggered_watchdog,disk_throughput},{disk_rkb,0.0},{disk_wkb,3.448e5}]

Current state of objects (seems fine):

[root@leo-m1 ~]# /usr/local/leofs/current/leofs-adm -p 10011 whereis bodytest/60/51/3d/60513d5c454f467f05ab51fe8c527e35649a5bd1b17b094f4c49ffea46cc4b4aed27a26210327518e0cfd98495af1f027e56970000000000.xz
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |            node             |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
  *    | [email protected]      | 965a3c8f6aa5d7a46509b80fcced3d63     |         0B |   d41d8cd98f | false          |              0 | 5513b07045c98  | 2017-06-05 21:54:06 +0300
  *    | [email protected]      | 965a3c8f6aa5d7a46509b80fcced3d63     |         0B |   d41d8cd98f | false          |              0 | 5513b07045c98  | 2017-06-05 21:54:06 +0300

[root@leo-m1 ~]# /usr/local/leofs/current/leofs-adm -p 10011 whereis 'bodytest/60/51/3d/60513d5c454f467f05ab51fe8c527e35649a5bd1b17b094f4c49ffea46cc4b4aed27a26210327518e0cfd98495af1f027e56970000000000.xz\n356d4977cd5f9741083c7a991ea02e3b'
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |            node             |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | 9f1b17aa182d3c21f77ed80ea30ca293     |         0B |   d41d8cd98f | false          |              0 | 5513b06fa1532  | 2017-06-05 21:54:05 +0300
       | [email protected]      | 9f1b17aa182d3c21f77ed80ea30ca293     |         0B |   d41d8cd98f | false          |              0 | 5513b06fa1532  | 2017-06-05 21:54:05 +0300

[root@leo-m1 ~]# /usr/local/leofs/current/leofs-adm -p 10011 whereis 'bodytest/90/81/99/90819974e82b6889c5efef7ac337c19aa7be9c49df92bfa3b03c55c578f60c5e8fb403baf2b7de9a4726d4bd3103b67af83b650000000000.xz'
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |            node             |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
  *    | [email protected]      | 7faeede606e842be597056e5d9f24b0c     |         0B |   d41d8cd98f | false          |              0 | 5513b0e26a0c8  | 2017-06-05 21:56:05 +0300
  *    | [email protected]      | 7faeede606e842be597056e5d9f24b0c     |         0B |   d41d8cd98f | false          |              0 | 5513b0e26a0c8  | 2017-06-05 21:56:05 +0300

[root@leo-m1 ~]# /usr/local/leofs/current/leofs-adm -p 10011 whereis 'bodytest/90/81/99/90819974e82b6889c5efef7ac337c19aa7be9c49df92bfa3b03c55c578f60c5e8fb403baf2b7de9a4726d4bd3103b67af83b650000000000.xz\n3222353f1cef553ff7cfb62a84f3a141'
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |            node             |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | 45bc06d6d1294cba9e323aa8fdc66015     |         0B |   d41d8cd98f | false          |              0 | 5513b0e1e8543  | 2017-06-05 21:56:05 +0300
       | [email protected]      | 45bc06d6d1294cba9e323aa8fdc66015     |         0B |   d41d8cd98f | false          |              0 | 5513b0e1e8543  | 2017-06-05 21:56:05 +0300

@mocchira
Copy link
Member

mocchira commented Jun 7, 2017

@vstax Thanks for reproducing!

"Object not deleted" bug did not happen this time. Both objects are correctly deleted on storage. I think that it went through code paths that caused the similar problem, but cleaned up correctly this time. Still no idea what happened the last time. Some race condition? Error during error handling / cleanup?

Yes this issue is kind of race condition as I commented on the above #722 (comment).
Those operation(put/delete) can be inverted chronologically could barely happen in theory however still could (You were very (un)lucky at the first time!)
We still have not come to the conclusion on how to fix although at least one thing we have to solve is to make a causality between put(s) and deletes caused by abort multipart upload (deletes must happen after puts).

@vstax
Copy link
Contributor Author

vstax commented Jun 8, 2017

@mocchira You won't believe my luck, but I managed to reproduce the original problem again. Well, kind of, partially: it's deleted on one node and not deleted on the other, but with R=1 it's enough to produce that permanent "object exists" reply from HEAD as many times as I do the request, until a single GET is performed, which returns 404 and causes repair to be performed; after which HEAD returns "object exists" for a short while, but eventually starts to return that object does not exist. Retries are disabled, so client has tried to do CompleteMultipartUpload just once. Watchdog was disabled.

Client error (it's not 403 anymore with the latest dev version):

2017-06-08 12:09:29,869 boto3.exceptions.S3UploadFailedError: Failed to upload /mnt/share2/storage/b0/37/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz to bodytest/b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz: An error occurred (ServiceUnavailable) when calling the CompleteMultipartUpload operation (reached max retries: 0): Please reduce your request rate.

storage_0:

[W]	[email protected]	2017-06-08 12:09:29.848714 +0300	1496912969	leo_storage_read_repairer:compare/4	165	[{node,'[email protected]'},{addr_id,51656637283870959635461270016942004031},{key,<<"bodytest/b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz\ne56bf0db47ea5af855fe6f2869ec6546">>},{clock,1496912969533108},{cause,not_found}]
[E]	[email protected]	2017-06-08 12:09:29.872532 +0300	1496912969	leo_storage_handler_object:put/4	416	[{from,storage},{method,delete},{key,<<"bodytest/b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz\n">>},{req_id,72291120},{cause,not_found}]

storage_1:

[E]	[email protected]	2017-06-08 12:09:29.859918 +0300	1496912969	leo_storage_handler_object:get/1	89	[{from,storage},{method,get},{key,<<"bodytest/b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz\ne56bf0db47ea5af855fe6f2869ec6546">>},{cause,not_found}]

storage_2 and gateway - nothing at all.

Body state:

[root@leo-m1 ~]# /usr/local/leofs/current/leofs-adm -p 10011 whereis bodytest/b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |            node             |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | 1f40fc266b30114d4ad5674777f24cc1     |         0B |   d41d8cd98f | false          |              0 | 5516f35d43341  | 2017-06-08 12:09:29 +0300
  *    | [email protected]      | 1f40fc266b30114d4ad5674777f24cc1     |         0B |   d41d8cd98f | false          |              0 | 5516f35d43b97  | 2017-06-08 12:09:29 +0300

[root@leo-m1 ~]# /usr/local/leofs/current/leofs-adm -p 10011 whereis "bodytest/b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz\ne56bf0db47ea5af855fe6f2869ec6546"
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |            node             |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | 26dcb4da5ff5f2b896161716fdc9d73f     |         0B |   d41d8cd98f | false          |              0 | 5516f35cf06b4  | 2017-06-08 12:09:29 +0300
       | [email protected]      | 26dcb4da5ff5f2b896161716fdc9d73f     |         0B |   d41d8cd98f | false          |              0 | 5516f35cf06b4  | 2017-06-08 12:09:29 +0300

Problem with HEAD (I've tried it a few times before as well to make sure that output is stable):

>>> s3client.head_object(Bucket='bodytest', Key='b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz')
{u'ContentLength': 0,
 u'ContentType': 'application/octet-stream',
 u'ETag': '"d41d8cd98f00b204e9800998ecf8427e"',
 u'LastModified': datetime.datetime(2017, 6, 8, 9, 9, 29, tzinfo=tzutc()),
 u'Metadata': {},
 'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
                                      'content-length': '0',
                                      'content-type': 'application/octet-stream',
                                      'date': 'Thu, 08 Jun 2017 09:22:00 GMT',
                                      'etag': '"d41d8cd98f00b204e9800998ecf8427e"',
                                      'last-modified': 'Thu, 08 Jun 2017 09:09:29 GMT',
                                      'server': 'LeoFS'},
                      'HTTPStatusCode': 200,
                      'RetryAttempts': 0}}

So, it's really HEAD persistently returning that the object exists (but has the size of 0). It also actually has ETag that matches MD5 of 0-byte object!

GET, however, fails (returns 404):

obj=s3client.get_object(Bucket='bodytest', Key='b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/botocore/client.py", line 253, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/lib/python2.7/site-packages/botocore/client.py", line 557, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

It causes repair:

[W]	[email protected]	2017-06-08 12:22:36.344658 +0300	1496913756	leo_storage_read_repairer:compare/4	165	[{node,'[email protected]'},{addr_id,41543489079341922987482890594795277505},{key,<<"bodytest/b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz">>},{clock,1496912969872193},{cause,primary_inconsistency}]

Trying HEAD again few seconds later, it succeeds at first:

>>> s3client.head_object(Bucket='bodytest', Key='b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz')
{u'ContentLength': 0,
 u'ContentType': 'application/octet-stream',
 u'ETag': '"d41d8cd98f00b204e9800998ecf8427e"',
 u'LastModified': datetime.datetime(2017, 6, 8, 9, 9, 29, tzinfo=tzutc()),
 u'Metadata': {},
 'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
                                      'content-length': '0',
                                      'content-type': 'application/octet-stream',
                                      'date': 'Thu, 08 Jun 2017 09:22:41 GMT',
                                      'etag': '"d41d8cd98f00b204e9800998ecf8427e"',
                                      'last-modified': 'Thu, 08 Jun 2017 09:09:29 GMT',
                                      'server': 'LeoFS'},
                      'HTTPStatusCode': 200,
                      'RetryAttempts': 0}}

but eventually starts to fail:

>>> s3client.head_object(Bucket='bodytest', Key='b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/site-packages/botocore/client.py", line 253, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/lib/python2.7/site-packages/botocore/client.py", line 557, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

Final state of the body:

[root@leo-m1 ~]# /usr/local/leofs/current/leofs-adm -p 10011 whereis bodytest/b0/37/2c/b0372ccddde07b884813b8f8c6fe146903d48fed18dbd735de4066e28206d6de6862ce1dab8b72b792892ef9cec3b3d700e0570000000000.xz
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |            node             |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
  *    | [email protected]      | 1f40fc266b30114d4ad5674777f24cc1     |         0B |   d41d8cd98f | false          |              0 | 5516f35d43b97  | 2017-06-08 12:09:29 +0300
  *    | [email protected]      | 1f40fc266b30114d4ad5674777f24cc1     |         0B |   d41d8cd98f | false          |              0 | 5516f35d43b97  | 2017-06-08 12:09:29 +0300

When I encountered this very first time, the del flag wasn't present for both nodes; no idea if it would allow inconsistency to be detected or not. Also, I have object_storage.is_strict_check = true set; it could be that the storage node has detected that this 0-byte object doesn't match actual ETag stored on the node and fix was performed because of that, and without that option it would've been different (no idea).

@mocchira
Copy link
Member

mocchira commented Jun 9, 2017

@vstax Many thanks!

Still WIP however it turned out that there is another problem that could increase the odds making a race condition (the one replica deleted while the other replica is not) happen when abort multipart happens.
I will share the detail once I get to have a confidence on my assumption.

EDIT: This code block https://github.com/leo-project/leofs/blob/1.3.4/apps/leo_gateway/src/leo_gateway_s3_api.erl#L949-L953 (put the zero byte object and subsequently delete it) seems to be culprit as the updates against the same object at the same timing increase the odds inconsistencies between replicas could happen under the system adopting eventual consistency. Now we are validating whether the code putting the zero byte object could be removed without any unexpected side effects.

@vstax
Copy link
Contributor Author

vstax commented Jun 9, 2017

@mocchira I see! Nice find. Definitely explains correct ETag for 0-byte object, something which really confused me.

I can't be sure about the very first time I've seen this problem, but it's quite likely that back then object had 0B size on both nodes as well.

This - put 0B before delete - seems like general logic, though, as during my experiments with deleting buckets I've also noticed that when I execute diagnose-start after these massive deletes, each object appears twice in the output log, once with "deleted" flag and once without, but with zero-byte size.

@mocchira
Copy link
Member

@vstax Now #768 merged into develop, I believe you would never have a luck to face this inconsistency problem again. let me know if you still have a luck. also please close the issue if there is nothing to be solved.

@vstax
Copy link
Contributor Author

vstax commented Jun 22, 2017

@mocchira Thank you for your support!

Over lots of testing, I've managed to encounter multipart abort 5 times; in all cases after it happens the main object looks as it never existed, e.g.

2017-06-22 23:32:42,001 ERROR MainProcess Error uploading (S3UploadFailedError('Failed to upload /mnt/share2/storage/b6/f9/b6f9776a003308bdb3fe2f8789f0a954c48554e3466f76a5710ed80018815239f17acc59e4a0ee4bdb3a7d0e72bdf38200ba990000000000.xz to body/b6/f9/77/b6f9776a003308bdb3fe2f8789f0a954c48554e3466f76a5710ed80018815239f17acc59e4a0ee4bdb3a7d0e72bdf38200ba990000000000.xz: An error occurred (ServiceUnavailable) when calling the CompleteMultipartUpload operation (reached max retries: 0): Please reduce your request rate.',)): b6f9776a003308bdb3fe2f8789f0a954c48554e3466f76a5710ed80018815239f17acc59e4a0ee4bdb3a7d0e72bdf38200ba990000000000.xz
[W]	[email protected]	2017-06-22 23:32:41.799239 +0300	1498163561	leo_storage_read_repairer:compare/4	165	[{node,'[email protected]'},{addr_id,58094798111872953951320957945385087342},{key,<<"body/b6/f9/77/b6f9776a003308bdb3fe2f8789f0a954c48554e3466f76a5710ed80018815239f17acc59e4a0ee4bdb3a7d0e72bdf38200ba990000000000.xz\nc63e0b2494dbee6f08a559117891c8ac">>},{clock,1498163561215448},{cause,not_found}]
[E]	[email protected]	2017-06-22 23:32:41.861678 +0300	1498163561	leo_storage_handler_object:get/3	156	[{from,gateway},{method,get},{key,<<"body/b6/f9/77/b6f9776a003308bdb3fe2f8789f0a954c48554e3466f76a5710ed80018815239f17acc59e4a0ee4bdb3a7d0e72bdf38200ba990000000000.xz\nc63e0b2494dbee6f08a559117891c8ac">>},{req_id,77170589},{cause,"Recover failure"}]
[E]	[email protected]	2017-06-22 23:32:41.873054 +0300	1498163561	leo_storage_handler_object:put/4	416	[{from,storage},{method,delete},{key,<<"body/b6/f9/77/b6f9776a003308bdb3fe2f8789f0a954c48554e3466f76a5710ed80018815239f17acc59e4a0ee4bdb3a7d0e72bdf38200ba990000000000.xz">>},{req_id,96277461},{cause,not_found}]
[E]	[email protected]	2017-06-22 23:32:41.873877 +0300	1498163561	leo_storage_handler_object:put/4	416	[{from,storage},{method,delete},{key,<<"body/b6/f9/77/b6f9776a003308bdb3fe2f8789f0a954c48554e3466f76a5710ed80018815239f17acc59e4a0ee4bdb3a7d0e72bdf38200ba990000000000.xz\n">>},{req_id,67988416},{cause,not_found}]
[root@leo-m1 ~]# /usr/local/leofs/current/leofs-adm -p 10011 whereis 'body/b6/f9/77/b6f9776a003308bdb3fe2f8789f0a954c48554e3466f76a5710ed80018815239f17acc59e4a0ee4bdb3a7d0e72bdf38200ba990000000000.xz'
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |            node             |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      |                                      |            |              |                |                |                | 
       | [email protected]      |                                      |            |              |                |                |                | 

[root@leo-m1 ~]# /usr/local/leofs/current/leofs-adm -p 10011 whereis 'body/b6/f9/77/b6f9776a003308bdb3fe2f8789f0a954c48554e3466f76a5710ed80018815239f17acc59e4a0ee4bdb3a7d0e72bdf38200ba990000000000.xz\nc63e0b2494dbee6f08a559117891c8ac'
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |            node             |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when            
-------+-----------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
  *    | [email protected]      | 2bb4a6b06f438baee8739bde18a0716e     |         0B |   d41d8cd98f | false          |              0 | 5529264bde3df  | 2017-06-22 23:33:12 +0300
  *    | [email protected]      | 2bb4a6b06f438baee8739bde18a0716e     |         0B |   d41d8cd98f | false          |              0 | 5529264bde3df  | 2017-06-22 23:33:12 +0300

so it's definitely different from before. I think it's safe to say that the problem is gone. I'm not aware of any more problems during uploads as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants