-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors on gateway/storage when uploading lots of files #722
Comments
@vstax Thanks for reporting in detail.
In LeoFS, {cause,unavailable} only happen while reducing incoming requests after leo_watchdog got triggered for some reason and that unavailable error results in responding 503 to a client.
It seems the [email protected] network isolated from other nodes around that time.
The first log in this three represents the root problem while others were caused by this first one.
In this case, the latter one seemed to happen.
As I commented above, this can happen while leo_watchdog reduces incoming requests.
This log is interesting because it seems the process related to log output (leo_logger) got overloaded.
As I mentioned above, Restating might be caused by leo_logger.
This seems multipart upload (cause multiple chunks being created on LeoFS) failed in the middle and deletes were caused to clean up all chunks partially uploaded so it looks fine to me. However..
This error looks wired.
However in this case, the tail part followed by |
@mocchira Thank you for analyzing.
This should be the case. In first experiment I had error and disk watchdog enabled (as well as debug logs). There were multiple cases of
Btw, does "error" watchdog react to errors like these? I have a lot of them in general since application logic now tries to download objects from LeoFS first before old storage, and not all data is moved to LeoFS. I assumed this is not a problem so filtered out all these messages from logs because there are so many of them.
This explains why the second experiment - with all watchdogs except for "rex" disabled - had no errors like the first one. Point taken: either disable watchdog for these bulk load scenarios or get ready to retry a lot (and don't use current version of "disk watchdog", anyway).
That is.. highly unlikely. All LeoFS nodes and the source system (from which I was uploading data) are VMs running on the same host and connected over virtual switch on that system. There is nothing relevant in "dmesg" for these VMs at that time as well.
Thanks, please do!
Still, 170K lines an hour doesn't exactly sound like insanely huge amounts. The log is not fsync'ed, right? A bit offtopic, but since you mentioned badarg from leveldb: this was just a few minutes before second experiment, I've tried to do "s3cmd ls " on a bucket containing ~6K names. This failed (as expected, I guess, I just felt like trying), but for a minute when nodes were trying to process this request I had this
and this in erlang.log
The second log is the one that seems interesting. This type of load managed to hit "high watermark"; memory usage was fine all around that time, e.g. from
Since there was no other IO load at that moment which could fill the cache (I started upload experiment 7 minutes later or so), process couldn't have allocated memory when that message happened: as you can see, buffers+cached is well over 6GB before and after it happened, if the process had allocated memory (and then free'ed it or restarted), at 16:40 there would've been less memory allocated to cache. In other words, the "system_memory_high_watermark" message happened without any actual memory usage by process. I've observed similar problems in experiments with NFS gateway in the past.
Yes, this looks suspicious. Just in case, I've verified that I have no objects that have The Also, despite this AccessDenied, HEAD had shown that this object exist in cluster. I deleted it and tried to re-upload with the same tool - this time it uploaded fine. |
Just wanted to add that I've performed this experiment once on brand-new 1.3.4 with debug logs disabled and disk & error watchdogs enabled (for the sake of experiment) and had 0 problems, I even switched load from 10 processes to 16 in the middle but still all objects (~1M, around 140GB in size) have been uploaded without any errors. Logs on gateway and storage nodes are all clean except for these messages from watchdog, but they did not affect the gateway
I think this adds to the suspicions that logger might be causing some problems. I'll try to repeat the experiment a few more times to maybe get that AccessDenied problem on multipart object. |
Thanks for the quick validation.
Yes I think so. Note on why upgrading leo_watchdog makes logs cleaner than before. The below are answers to your questions.
No. not_found is excluded from targets "error" watchdog monitors.
Got it. This might be explained with the above comment (leofs can be overloaded without leo-project/leo_watchdog@8a30a17)
Right. 170K lines an hour looks sane and logs is not fsync'ed.
This looks another kind of badarg from leveldb. I will vet.
There are some cases that erlang runtime malloc'ed memory regions for their own heap from an OS and NOT gave those back to the OS even when those got unnecessary.
WIP. |
@vstax As I've found almost everything that can explain what you've faced, I will share below one by one.
Filed as another issue on #729.
Filed as another issue on #730.
It turned out that the badarg from leveldb at this time could happen around when leo_storage restarted so that this badarg itself isn't problem.
Sorry. this is wrong. Now As the root causes have been covered on #729, #730 and also #653, |
@mocchira Thank you for investigation! The one that really bothers me is Unfortunately, after looking at |
Prerequisite
Answers
Unfortunately it could happen due to the inherent nature of the eventual consistency regardless of its operations whether it's uploaded through multipart or singlepart. My guess how it happened on your case.
It won't be done because it's spec at the moment.
If I understand what you asked correctly, |
@mocchira
Interesting; I thought this is to be handled on server side, not on client. Does that mean that if network failure has occured between client and the gateway, or if client just died (OOM or power outage etc.), the remains of objects (some parts and such) may remain on LeoFS? In more detail: Everything will be fine if the same object is uploaded again. However, I got questions about other cases, when original client didn't re-upload the object.
You're absolutely right. It's my bad, I forgot about this detail for a moment and asked a wrong question. Yes, eventual consistency is no problem and can be handled in application designed with that in mind (also there is always an option of having R+ (W or D) > N, plus disabling LeoGateway cache in case of LeoFS to get strong consistency). Anyhow, my question was wrong, "HEAD" or "GET" working right after "DELETE" won't break application that can handle eventual consistency. What I really meant to ask was question №2 from the above. If it's possible to be getting "object exists" from "HEAD" over and over even when cluster seems fully consistent and nothing was broken on storage nodes or in queues. In other words, stable inconsistency between object presence flag and being able to actually get that object. This is only relevant for broken multipart upload because of trouble between client and gateway (my understanding is that it's impossible to get such inconsistency for single-part object, as long as you rule out things like gateway cache, at least).
Why would it? After I did "HEAD" I executed "whereis" command and "del" flag wasn't present on any of two replicas. So it wouldn't be deleted by itself, I think. I had to do "DELETE" myself and re-upload it after that...
Hmm but they are about somewhat different things? RAW is about getting strong consistency in the most important place to make application logic easier, while #561 is about writes ignoring W setting internally - application thinks that write didn't succeed while in reality it did. There is similarity between them, but use case is different: RAW is important for normal cluster operations while #561 is needed to get rid of problems when cluster is in degraded state.. Well, no matter: they both can be worked around (one can live without RAW by passing ETag together with object name between clients, and without #561 by always re-uploading object or checking its presence with HEAD after error). And even if #561 to be implemented, there are some theoretical cases when client gets an error after successful upload, even though that would be rare..
Actually, I meant initial upload of multipart object, not its update as described in #719. In this test, the logic was strict:
after it failed with AccessDenied and getting these errors in logs from clean-up attempt, I've executed it again hours later and it said that object does exist. I've executed "whereis" and object was not deleted. So I had to delete it, check with "whereis" that it's marked as deleted on both nodes and launched the application that uploaded this object again, this time without problems. I mentioned #561 only as a possible thought of what might've been going on behind the scenes; this test does not rely on whether problem from #561 is present or not. I'm bothered because there clearly was not only an error to client, but clean-up attempt as well, so why something would be present after that is beyond me. I understand that this is very theoretical, I should've gathered more information about that object before removing it. Which is why I hope to encounter that problem again... |
@vstax thanks for your insightful reply.
Yes and the original AWS S3 also does the same behavior (rely on clients by default and you have to pay the money for storing such garbage parts!).
If its delete flag is not set,
This behavior depends on how much progress there is.
so there should be no cases returning something broken to the client however as I said above, garbage parts can be remained.
As a stable case and besides #719 then there should be no chance.
Removed.
Got it.
The stable inconsistency should not happen even it's handled through multipart.
Sorry the last paragraph was wrong (I missed the situation you actually faced that the object existed hours later).
Got it.
Hope too and please share if you have a luck. |
@mocchira Thank you for analyzing this problem. I see that you've found some pretty well hidden bugs! Impressive. I don't really know what happened there; other than logs from gateway & storage_1/storage_2 that I've shown before I only got application logs, and they don't show much more than "AccessDenied" (well, I also know that it happened between 17:37:27 - last success operation and 17:37:30 - start of next operation), but that doesn't tell anything new. I had retries completely disabled in boto3, it's configured to fail after single operation got wrong (for our applications, it's better to get error as fast as possible, than bear with increasing timeouts that boto3 and many other S3 clients do between retries). There obviously was no network failure, as for watchdog, well, there are no messages in info/error log related to that (or any other messages in logs around that time, for the matter) on any node or gateway, but as you mention, it could be something else that maybe was not logged.
as per boto/botocore#882) Anyhow, after the latest pull request is merged I'll try to do a few test with both uploads and deletes and share my results. Also, I have this idea: there is some code that returned AccessDenied (#730); would it be possible to make a hack that this "CompleteMultipartUpload" operation on gateway always switches to that logic and returns 403 to client, then fails whenever CompleteMultipartUpload is called? With a few experiments (by moving code block that returns that error around so it can be tried before or after actual logic that handles this CompleteMultipartUpload call) it should be possible to trigger this; the situation will be a bit different, but I think that it's likely to see how client and gateway behaves then and maybe even see the same exact problem. boto3+botocore is quite a bit over-engineered so I don't know how to track its logic by looking at its code... |
Sure. I will make a branch for that and share here later.
Absolutely ;). |
@vstax Sorry for the long delay. The below patch is what you want. diff --git a/apps/leo_gateway/src/leo_gateway_s3_api.erl b/apps/leo_gateway/src/leo_gateway_s3_api.erl
index e330cd3..8c155e3 100644
--- a/apps/leo_gateway/src/leo_gateway_s3_api.erl
+++ b/apps/leo_gateway/src/leo_gateway_s3_api.erl
@@ -959,20 +959,26 @@ handle_2({ok,_AccessKeyId}, Req, ?HTTP_DELETE,_Key,
%% For Multipart Upload - Completion
handle_2({ok,_AccessKeyId}, Req, ?HTTP_POST,_Key,
- #req_params{bucket_info = BucketInfo,
+ #req_params{bucket_info = _BucketInfo,
path = Path,
- chunked_obj_len = ChunkedLen,
+ chunked_obj_len = _ChunkedLen,
is_upload = false,
upload_id = UploadId,
upload_part_num = PartNum,
transfer_decode_fun = TransferDecodeFun,
transfer_decode_state = TransferDecodeState}, State) when UploadId /= <<>>,
PartNum == 0 ->
- Res = cowboy_req:has_body(Req),
-
- {ok, Req_2} = handle_multi_upload_1(
- Res, Req, Path, UploadId,
- ChunkedLen, TransferDecodeFun, TransferDecodeState, BucketInfo),
+ BodyOpts = case TransferDecodeFun of
+ undefined ->
+ [];
+ _ ->
+ [{transfer_decode, TransferDecodeFun, TransferDecodeState}]
+ end,
+ _Res = cowboy_req:has_body(Req),
+ Ret = cowboy_req:body(Req, BodyOpts),
+ {ok, Req_2} =
+ ?reply_forbidden([?SERVER_HEADER], ?XML_ERROR_CODE_AccessDenied,
+ ?XML_ERROR_MSG_AccessDenied, Path, <<>>, Req),
{ok, Req_2, State}; let me know if you find something. |
@mocchira Thank you!
boto3 error is exactly the same:
so we can assume that this single DELETE request was performed in the last case as well. An error logged on storage_0:
But the object is correctly deleted:
For comparison, in original case the logs of storage_0, gateway and storage_1 were (same ones as posted before, just summarizing here):
The last error is the same as in this experiment; just like in that one, it's logged only on secondary node (which is storage_0 for this file name but was storage_1 for the original one). The errors before that should be unrelated to DELETE.. which doesn't explain why DELETE had failed to actually delete the object and doesn't lead anywhere. I'll try to think of something else, and perform some high-load experiments next week. |
@vstax thanks! FWIW, let me share one thing.
Yes that should be unrelated to DELETE itself and also doesn't explain why it had failed. and... I noticed while writing this reply. There might be the case
Those operation can be inverted chronologically if proceeding the complete request delayed somehow because those operation got totally independent (not having any causality). This case can be mitigated by #736 however there are other corner cases such inverted operations could happen so I will keep vetting. |
@mocchira The problem causing "Object not deleted" bug did not happen this time. Both objects are correctly deleted on storage. I think that it went through code paths that caused the similar problem, but cleaned up correctly this time. Still no idea what happened the last time. Some race condition? Error during error handling / cleanup? Anyhow, here are logs. I've merged info and error logs for storage node. Please note that some is unrelated - e.g. some memory watermark was hit 2 minutes after the problem. However, it's still a bit strange because it did not happen any more during next few hours of stress testing. The cluster was completely wiped for this test, and this Clients:
Gateway:
storage_0:
storage_1:
storage_2
Current state of objects (seems fine):
|
@vstax Thanks for reproducing!
Yes this issue is kind of race condition as I commented on the above #722 (comment). |
@mocchira You won't believe my luck, but I managed to reproduce the original problem again. Well, kind of, partially: it's deleted on one node and not deleted on the other, but with R=1 it's enough to produce that permanent "object exists" reply from HEAD as many times as I do the request, until a single GET is performed, which returns 404 and causes repair to be performed; after which HEAD returns "object exists" for a short while, but eventually starts to return that object does not exist. Retries are disabled, so client has tried to do CompleteMultipartUpload just once. Watchdog was disabled. Client error (it's not 403 anymore with the latest dev version):
storage_0:
storage_1:
storage_2 and gateway - nothing at all. Body state:
Problem with HEAD (I've tried it a few times before as well to make sure that output is stable):
So, it's really HEAD persistently returning that the object exists (but has the size of 0). It also actually has ETag that matches MD5 of 0-byte object! GET, however, fails (returns 404):
It causes repair:
Trying HEAD again few seconds later, it succeeds at first:
but eventually starts to fail:
Final state of the body:
When I encountered this very first time, the del flag wasn't present for both nodes; no idea if it would allow inconsistency to be detected or not. Also, I have |
@vstax Many thanks! Still WIP however it turned out that there is another problem that could increase the odds making a race condition (the one replica deleted while the other replica is not) happen when abort multipart happens. EDIT: This code block https://github.com/leo-project/leofs/blob/1.3.4/apps/leo_gateway/src/leo_gateway_s3_api.erl#L949-L953 (put the zero byte object and subsequently delete it) seems to be culprit as the updates against the same object at the same timing increase the odds inconsistencies between replicas could happen under the system adopting eventual consistency. Now we are validating whether the code putting the zero byte object could be removed without any unexpected side effects. |
@mocchira I see! Nice find. Definitely explains correct ETag for 0-byte object, something which really confused me. I can't be sure about the very first time I've seen this problem, but it's quite likely that back then object had 0B size on both nodes as well. This - put 0B before delete - seems like general logic, though, as during my experiments with deleting buckets I've also noticed that when I execute |
@mocchira Thank you for your support! Over lots of testing, I've managed to encounter multipart abort 5 times; in all cases after it happens the main object looks as it never existed, e.g.
so it's definitely different from before. I think it's safe to say that the problem is gone. I'm not aware of any more problems during uploads as well. |
Sorry for the long ticket and muddy description, I might be trying to squeeze a few unrelated problems in here, but I wanted to list them all because I don't know relations between them.
I'm trying to migrate existing data to LeoFS. I wrote a script that uploads data from filesystem and trying to run it on some small (development) subset of objects; their properties are - size from 1 byte to 20-25 MB, 120 KB on average. leo_gateway is running in S3 mode and the script basically wraps boto3's upload_file() method. Multi-threaded upload is disabled in boto3, as it can consume lots of RAM on the client; amount of retries set to 1, because I wanted this experiment not to hide possible problems.
Upload logic is as follows:
Create a list of files on given path (each name is 115 bytes with pattern like
fad90104322ec7545b7ec7a19562773b4f77b82aeb73b715809547e874f86938f5b95361337c87edcf91b887d19bfc1dced4000000000000.xz
). For each file in the list:HEAD body/fa/d9/01/fad90104322ec7545b7ec7a19562773b4f77b82aeb73b715809547e874f86938f5b95361337c87edcf91b887d19bfc1dced4000000000000.xz
(where "body" is a name of bucket)Some (1-2%) of objects are already uploaded to this cluster, which is why checking for object existence is needed. Plus, it allows to stop and resume these uploads at any point. No objects are replaced, only new objects are uploaded. Autocompaction is disabled. Data is uploaded in parallel (6-10 processes at once) to test LeoFS cluster (3 storage nodes, N=2, W=1, R=1); I want this upload experiment to fully succeed before actually creating a real cluster (and repeating these uploads after that). However, I'm facing various problems and need some advice.
(yes, I'm aware that sometimes operations with gateway may fail due to high load / timeouts / watchdog / etc; that's what application logic, retries and so on are for. The problem here is not a failure by itself, current upload script doesn't do retries on purpose - on real production, amount of data we'll need to move is magnitudes bigger than for this experiment, I just expect this small amount of data - ~130GB in this case - to be moved without a hitch in order to be sure that production cluster will be able to handle its load)
Experiment 1
(performed on Apr 21 with development version from around that time. It's supposedly has fixes for "badarg" related to leveldb and such). I tried to upload data in 10 processes, each process was supposed to upload around 70K objects (~11 GB). The task was running at night so there should've been no other load on storage holding VMs with LeoFS. However, since it's the same storage for all VMs, some bad effects could be possible.
Problem 1: around 80 object failed to upload. On client error was
On gateway: (no error)
On storage nodes: I can see this object in log on a single storage node.
Problem 2: some timeouts and other strange errors. Example, on gateway:
On storage node:
There are quite a few cases of
cause,timeout
,cause, "Replicate failure"
andcause,unavailable
on storage nodes. What does "cause,unavailable" mean for PUT request, anyway?One of the nodes also got these two messages in log:
Problem 3: storage nodes restarting under load.
Unfortunately, other than information in erlang.log on storage node about restart and "EXIT" mentioned in log of gateway nodes, the restarting storage node leave no logs at all. The log goes on to the point when other nodes mention this one restarting, then a minute later logs start again; there are absolutely no errors related to restart or any crash dumps or anything. A few restarts of storage nodes happened during the cause of experiment, some errors on other storage nodes correlate to the moment when some storage node got restarted.
Experiment 2
(performed on May 1 with development version that was virtually like 1.3.3). I made changes to configuration of storage nodes: set ERL_CRASH_DUMP_SECONDS=-1, disabled disk and error watchdogs that were enabled during the previous experiment (disk watchdog does more harm than good in current state, I just forgot to disable it before). I was uploading in 6 threads, with the same properties (~70K objects from each thread, or 11 GB); 2-3% of this data already existed on the storage, these objects were not re-uploaded. Only wrote the new objects.
Interestingly enough, there were no restarts of storage nodes this time. The errors were completely different this time around; none of the problems from previous experiment happened here.
Problem (a funny case):
On client, I got
In gateway logs:
On storage_0:
On storage_1:
Delete?? What does it mean? No one was doing "delete"...
The text was updated successfully, but these errors were encountered: