[leo_storage] GET can be false negative more than necessary #553

mocchira · 2016-12-21T01:14:55Z

reported here.
https://groups.google.com/d/msg/leoproject_leofs/tLgNlvK7Eps/uEm1J9qGDgAJ

summary picked from the above thread

Condition

Conditions: N=2 (it could be 3, but 2 makes explanation easier), R=1, W=1 (for performance) 

There are a few storage nodes. For the sake of experiment, let's consider file named "X" which should go to nodes S0 and S1.

Procedure

== Application actions: "GET X". Result: we get object X.

1) A power outage happens in rack holding nodes S0 and S1. Gateway, management nodes and application are not affected so the load continues. Now:

== Application actions: "GET X". Result: we get error 500, since all nodes holding object X are unavailable.

2) S0 node returns, but one of its disks (holding data & metadata for, say, AVS N5) is lost. It was the one holding object X. 

== Application actions: "GET X". Result: 404 not found (oops!). Same as for missing object

3) S1 node returns.

== Application actions: "GET X". Result: we get object X. (also, it's repaired on S0).

Conclusion
In the above scenario, eventually consistency is violated, as object state changes "exists -> doesn't exist -> exist".
We have to respond 500 instead of 404 at 2).

The text was updated successfully, but these errors were encountered:

windkit · 2016-12-21T01:32:35Z

To me, when all replicas are out, the data is lost already. One could be lucky to get back some later on but that's not guaranteed any more.

And I would not say the eventual consistency is violated in this case, there are changes for each step, and those changes would eventually propagate to all copies.

Nodes Down -> We have data lost
Nodes Back -> We recover the data

Not much difference from

Delete Object -> 404
Put Back Object -> 200

mocchira · 2016-12-21T04:23:47Z

@windkit

To me, when all replicas are out, the data is lost already.
One could be lucky to get back some later on but that's not guaranteed any more.

At a first glance reading this thread I thought what you think.
but after a while I've changed my mind that it would be good to not appear to be lost from the application(user) perspective before they to rebalance as there is a chance one get back.

just in case,
let's take a close look at (current|ideally) what leofs should respond in each cases I'd summarize below.

https://docs.google.com/spreadsheets/d/1Ebmy6MzRknp3NGpcjDpXljQMKTZqq8GnsJWULN5pVyI/edit?usp=sharing

obviously the current behaviour responding 404(not truth) in case one of replica is down should be fixed by responding 500(unknown) as it's at least not lied.

windkit · 2016-12-21T04:55:52Z

First of all, let me align the understanding of the chart

For the case at J6, it means

Node	Obj	Status
S0	No	Alive
S1	Yes	Dead

Now it would reply 404 as the only alive replica for the object cannot find it locally.
So you are suggesting that it should reply 500 instead, that means S0 should be aware that someone has it but not itself.

That makes things complicated, now the node has much more state to remember.

I have the object
I don't have the object
I don't have the object but someone has

Things go crazier if we consider versioning too (e.g. I have v1 but I know someone has v1+)

To me, keep things simple is the key to have a good consistency model.

windkit · 2016-12-21T04:57:22Z

And actually how do you define not but actually exist and Exist but actually not?
Does it have anythings to do with the inconsistency between metadata and object file? That's something we definitely have to fix though.

windkit · 2016-12-21T05:05:35Z

About consistency, one of the biggest problem is the incomplete writes are not roll backed. (Not 2PC)

For a (N=3, W=2), when two replicas fail, the write is rejected with code 500
Later on when the two replicas return, they would be "fixed" to hold the rejected write, while user expect it is not written.

For example, let's update a object

Begin

$ ./leofs-adm whereis test/test.txt
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |          node          |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |         3K |   63834dbb20 | false          |              0 | 5440fb2c55ce3  | 2016-12-20 13:44:59 +0900
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |         3K |   63834dbb20 | false          |              0 | 5440fb2c55ce3  | 2016-12-20 13:44:59 +0900
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |         3K |   63834dbb20 | false          |              0 | 5440fb2c55ce3  | 2016-12-20 13:44:59 +0900

Failure of two replicas

$ ./leofs-adm whereis test/test.txt
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |          node          |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |         3K |   63834dbb20 | false          |              0 | 5440fb2c55ce3  | 2016-12-20 13:44:59 +0900
       | [email protected]      |                                      |            |              |                |                |                |
       | [email protected]      |                                      |            |              |                |                |                |

Attempt to Update (Should be rejected) [OK]

$ ./s3cmd put README.md s3://test/test.txt
WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
upload: 'README.md' -> 's3://test/test.txt'  [1 of 1]
 14932 of 14932   100% in    0s     3.22 MB/s  done
WARNING: Upload failed: /test.txt (500 (InternalError): We encountered an internal error. Please try again.)
WARNING: Waiting 3 sec...

Replicas back

$ ./leofs-adm whereis test/test.txt
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |          node          |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |        15K |   52e1144808 | false          |              0 | 54423e7dde244  | 2016-12-21 13:51:19 +0900
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |         3K |   63834dbb20 | false          |              0 | 5440fb2c55ce3  | 2016-12-20 13:44:59 +0900
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |         3K |   63834dbb20 | false          |              0 | 5440fb2c55ce3  | 2016-12-20 13:44:59 +0900

Read (Return Newest Version) [??]

$ ./s3cmd get s3://test/test.txt dl
download: 's3://test/test.txt' -> 'dl'  [1 of 1]
 14932 of 14932   100% in    0s     3.23 MB/s  done

Replicas got fixed [??]

$ ./leofs-adm whereis test/test.txt
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |          node          |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |        15K |   52e1144808 | false          |              0 | 54423e7dde244  | 2016-12-21 13:51:19 +0900
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |        15K |   52e1144808 | false          |              0 | 54423e7dde244  | 2016-12-21 13:51:19 +0900
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |        15K |   52e1144808 | false          |              0 | 54423e7dde244  | 2016-12-21 13:51:19 +0900

mocchira · 2016-12-21T05:41:50Z

@windkit
Yes your understanding is right.
things looks pretty easy to implement to me.
don't need to remember the additional status so there is no needs to define (xx but actually and so on).
just add an additional check the errors in leo_storage_handler_object
that return 500 when all resonses from storages are error and 500 is included.
No additional status, no addtional rpc.

and regarding consistency reply from you
as I think it's another problem (now only talk about handling GET from gateway(s)),
please open the new issue (IMHO at least what you pasted should be documented as a spec)

windkit · 2016-12-21T05:47:11Z

@mocchira Ya I agree the later is side tracked, a new issue should be suitable.

So back to the case of

Node	Obj	Status
S0	No	Alive
S1	Yes	Dead

You are suggesting that S0 should reply 500 in the case when it finds S1 is unavailable without knowing if S1 actually has the object?

Node	Obj	Status
S0	No	Alive
S1	No	Dead

Also got a 500 instead of 404?

mocchira · 2016-12-21T07:17:39Z

@windkit

@mocchira Ya I agree the later is side tracked, a new issue should be suitable.

thank you.

Also got a 500 instead of 404?

yes.

Impl	Case1	Case2
Current(404)	Wrong	Right
Suggest(500)	Unknown	Unknown

The purpose of responding 500 is to avoid false negative if possible.

for instance an application writing code like the following

### pseudo
code := http.GET(X).Reponse().Code
if code != 200 {
  if code == 404 {
    http.PUT(X, NewBody)
  } else {
    // retry later when getting 500/503/504 
    ...
  }
}

may issue unnecessary PUT operations with the current but can be avoidable with the new one.
(and probably the ML guy have this kind of code so have a concern)

also the title (cited from ML thread) is slightly different from what the eventually consistency actually means
and may give a wrong impression to someone so I'd like to change to the appropriate one.

windkit · 2016-12-21T07:23:48Z

@mocchira OK I got it now, i agree that would be simple solution.

mocchira · 2016-12-21T07:46:59Z

@windkit thank you for sparing your time.

I updated #532 based on your comments( #553 (comment) ) about consistency.
I think there is still some edge cases that are explicitly not documented so please drop|update #532 if you noticed there is something to be explicit as spec or maybe to be fixed.

mocchira · 2017-11-30T07:47:42Z

IMPL

where to check if errors include 500
- https://github.com/leo-project/leofs/blob/develop/apps/leo_storage/src/leo_storage_handler_object.erl#L1210
- https://github.com/leo-project/leofs/blob/develop/apps/leo_storage/src/leo_storage_handler_object.erl#L1327-L1333
  - Actually the 2nd element in an error tuple includes a list which type is [{node(), Reason::term()}] so have to enumerate the list

windkit · 2017-12-05T07:45:08Z

PR: #948

mocchira added Bug Priority-MIDDLE Question v1.3 _leo_storage labels Dec 21, 2016

mocchira added this to the 1.3.2 milestone Dec 21, 2016

mocchira changed the title ~~[leo_storage] eventually consistency can be violated~~ [leo_storage] GET can be false positive more than necessary Dec 21, 2016

mocchira changed the title ~~[leo_storage] GET can be false positive more than necessary~~ [leo_storage] GET can be false negative more than necessary Dec 21, 2016

mocchira mentioned this issue Dec 21, 2016

[doc] Improve the detailed explanation about how LeoFS works #532

Open

79 tasks

yosukehara modified the milestones: 1.4.0, 1.3.2 Jan 6, 2017

yosukehara added v1.4 and removed v1.3 labels Jan 6, 2017

mocchira mentioned this issue Apr 18, 2017

Objects created when storage node was down don't arrive there after it's up #702

Closed

yosukehara assigned mocchira May 8, 2017

windkit added a commit to windkit/leofs that referenced this issue Dec 5, 2017

storage: Fix for leo-project#553

fba8a7a

windkit added a commit to windkit/leofs that referenced this issue Dec 6, 2017

storage: Fix for leo-project#553

a24b423

windkit added a commit to windkit/leofs that referenced this issue Dec 6, 2017

storage: Fix for leo-project#553

930b1d4

windkit added a commit to windkit/leofs that referenced this issue Dec 6, 2017

storage: Fix for leo-project#553

76cc368

mocchira pushed a commit that referenced this issue Dec 7, 2017

storage: Fix for #553

327f9ea

mocchira closed this as completed Dec 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[leo_storage] GET can be false negative more than necessary #553

[leo_storage] GET can be false negative more than necessary #553

mocchira commented Dec 21, 2016

windkit commented Dec 21, 2016

mocchira commented Dec 21, 2016

windkit commented Dec 21, 2016

windkit commented Dec 21, 2016

windkit commented Dec 21, 2016

mocchira commented Dec 21, 2016

windkit commented Dec 21, 2016

mocchira commented Dec 21, 2016 •

edited

Loading

windkit commented Dec 21, 2016

mocchira commented Dec 21, 2016

mocchira commented Nov 30, 2017

windkit commented Dec 5, 2017

[leo_storage] GET can be false negative more than necessary #553

[leo_storage] GET can be false negative more than necessary #553

Comments

mocchira commented Dec 21, 2016

summary picked from the above thread

windkit commented Dec 21, 2016

mocchira commented Dec 21, 2016

windkit commented Dec 21, 2016

windkit commented Dec 21, 2016

windkit commented Dec 21, 2016

mocchira commented Dec 21, 2016

windkit commented Dec 21, 2016

mocchira commented Dec 21, 2016 • edited Loading

windkit commented Dec 21, 2016

mocchira commented Dec 21, 2016

mocchira commented Nov 30, 2017

windkit commented Dec 5, 2017

mocchira commented Dec 21, 2016 •

edited

Loading