Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[leo_storage] GET can be false negative more than necessary #553

Closed
mocchira opened this issue Dec 21, 2016 · 12 comments
Closed

[leo_storage] GET can be false negative more than necessary #553

mocchira opened this issue Dec 21, 2016 · 12 comments

Comments

@mocchira
Copy link
Member

reported here.
https://groups.google.com/d/msg/leoproject_leofs/tLgNlvK7Eps/uEm1J9qGDgAJ

summary picked from the above thread

  • Condition
Conditions: N=2 (it could be 3, but 2 makes explanation easier), R=1, W=1 (for performance) 

There are a few storage nodes. For the sake of experiment, let's consider file named "X" which should go to nodes S0 and S1.
  • Procedure
== Application actions: "GET X". Result: we get object X.

1) A power outage happens in rack holding nodes S0 and S1. Gateway, management nodes and application are not affected so the load continues. Now:

== Application actions: "GET X". Result: we get error 500, since all nodes holding object X are unavailable.

2) S0 node returns, but one of its disks (holding data & metadata for, say, AVS N5) is lost. It was the one holding object X. 

== Application actions: "GET X". Result: 404 not found (oops!). Same as for missing object

3) S1 node returns.

== Application actions: "GET X". Result: we get object X. (also, it's repaired on S0).
  • Conclusion
    In the above scenario, eventually consistency is violated, as object state changes "exists -> doesn't exist -> exist".
    We have to respond 500 instead of 404 at 2).
@windkit
Copy link
Contributor

windkit commented Dec 21, 2016

To me, when all replicas are out, the data is lost already. One could be lucky to get back some later on but that's not guaranteed any more.

And I would not say the eventual consistency is violated in this case, there are changes for each step, and those changes would eventually propagate to all copies.

Nodes Down -> We have data lost
Nodes Back -> We recover the data

Not much difference from

Delete Object -> 404
Put Back Object -> 200

@mocchira
Copy link
Member Author

@windkit

To me, when all replicas are out, the data is lost already.
One could be lucky to get back some later on but that's not guaranteed any more.

At a first glance reading this thread I thought what you think.
but after a while I've changed my mind that it would be good to not appear to be lost from the application(user) perspective before they to rebalance as there is a chance one get back.

just in case,
let's take a close look at (current|ideally) what leofs should respond in each cases I'd summarize below.

https://docs.google.com/spreadsheets/d/1Ebmy6MzRknp3NGpcjDpXljQMKTZqq8GnsJWULN5pVyI/edit?usp=sharing

obviously the current behaviour responding 404(not truth) in case one of replica is down should be fixed by responding 500(unknown) as it's at least not lied.

@windkit
Copy link
Contributor

windkit commented Dec 21, 2016

First of all, let me align the understanding of the chart

For the case at J6, it means

Node Obj Status
S0 No Alive
S1 Yes Dead

Now it would reply 404 as the only alive replica for the object cannot find it locally.
So you are suggesting that it should reply 500 instead, that means S0 should be aware that someone has it but not itself.

That makes things complicated, now the node has much more state to remember.

  1. I have the object
  2. I don't have the object
  3. I don't have the object but someone has

Things go crazier if we consider versioning too (e.g. I have v1 but I know someone has v1+)

To me, keep things simple is the key to have a good consistency model.

@windkit
Copy link
Contributor

windkit commented Dec 21, 2016

And actually how do you define not but actually exist and Exist but actually not?
Does it have anythings to do with the inconsistency between metadata and object file? That's something we definitely have to fix though.

@windkit
Copy link
Contributor

windkit commented Dec 21, 2016

About consistency, one of the biggest problem is the incomplete writes are not roll backed. (Not 2PC)

For a (N=3, W=2), when two replicas fail, the write is rejected with code 500
Later on when the two replicas return, they would be "fixed" to hold the rejected write, while user expect it is not written.

For example, let's update a object

Begin

$ ./leofs-adm whereis test/test.txt
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |          node          |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |         3K |   63834dbb20 | false          |              0 | 5440fb2c55ce3  | 2016-12-20 13:44:59 +0900
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |         3K |   63834dbb20 | false          |              0 | 5440fb2c55ce3  | 2016-12-20 13:44:59 +0900
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |         3K |   63834dbb20 | false          |              0 | 5440fb2c55ce3  | 2016-12-20 13:44:59 +0900

Failure of two replicas

$ ./leofs-adm whereis test/test.txt
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |          node          |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |         3K |   63834dbb20 | false          |              0 | 5440fb2c55ce3  | 2016-12-20 13:44:59 +0900
       | [email protected]      |                                      |            |              |                |                |                |
       | [email protected]      |                                      |            |              |                |                |                |

Attempt to Update (Should be rejected) [OK]

$ ./s3cmd put README.md s3://test/test.txt
WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
upload: 'README.md' -> 's3://test/test.txt'  [1 of 1]
 14932 of 14932   100% in    0s     3.22 MB/s  done
WARNING: Upload failed: /test.txt (500 (InternalError): We encountered an internal error. Please try again.)
WARNING: Waiting 3 sec...

Replicas back

$ ./leofs-adm whereis test/test.txt
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |          node          |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |        15K |   52e1144808 | false          |              0 | 54423e7dde244  | 2016-12-21 13:51:19 +0900
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |         3K |   63834dbb20 | false          |              0 | 5440fb2c55ce3  | 2016-12-20 13:44:59 +0900
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |         3K |   63834dbb20 | false          |              0 | 5440fb2c55ce3  | 2016-12-20 13:44:59 +0900

Read (Return Newest Version) [??]

$ ./s3cmd get s3://test/test.txt dl
download: 's3://test/test.txt' -> 'dl'  [1 of 1]
 14932 of 14932   100% in    0s     3.23 MB/s  done

Replicas got fixed [??]

$ ./leofs-adm whereis test/test.txt
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
 del?  |          node          |             ring address             |    size    |   checksum   |  has children  |  total chunks  |     clock      |             when
-------+------------------------+--------------------------------------+------------+--------------+----------------+----------------+----------------+----------------------------
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |        15K |   52e1144808 | false          |              0 | 54423e7dde244  | 2016-12-21 13:51:19 +0900
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |        15K |   52e1144808 | false          |              0 | 54423e7dde244  | 2016-12-21 13:51:19 +0900
       | [email protected]      | 5ca400ad12387f6fd23150ba54f2a1b7     |        15K |   52e1144808 | false          |              0 | 54423e7dde244  | 2016-12-21 13:51:19 +0900

@mocchira
Copy link
Member Author

@windkit
Yes your understanding is right.
things looks pretty easy to implement to me.
don't need to remember the additional status so there is no needs to define (xx but actually and so on).
just add an additional check the errors in leo_storage_handler_object
that return 500 when all resonses from storages are error and 500 is included.
No additional status, no addtional rpc.

and regarding consistency reply from you
as I think it's another problem (now only talk about handling GET from gateway(s)),
please open the new issue (IMHO at least what you pasted should be documented as a spec)

@windkit
Copy link
Contributor

windkit commented Dec 21, 2016

@mocchira Ya I agree the later is side tracked, a new issue should be suitable.

So back to the case of

Node Obj Status
S0 No Alive
S1 Yes Dead

You are suggesting that S0 should reply 500 in the case when it finds S1 is unavailable without knowing if S1 actually has the object?

Node Obj Status
S0 No Alive
S1 No Dead

Also got a 500 instead of 404?

@mocchira
Copy link
Member Author

mocchira commented Dec 21, 2016

@windkit

@mocchira Ya I agree the later is side tracked, a new issue should be suitable.

thank you.

Also got a 500 instead of 404?

yes.

Impl Case1 Case2
Current(404) Wrong Right
Suggest(500) Unknown Unknown

The purpose of responding 500 is to avoid false negative if possible.

for instance an application writing code like the following

### pseudo
code := http.GET(X).Reponse().Code
if code != 200 {
  if code == 404 {
    http.PUT(X, NewBody)
  } else {
    // retry later when getting 500/503/504 
    ...
  }
}

may issue unnecessary PUT operations with the current but can be avoidable with the new one.
(and probably the ML guy have this kind of code so have a concern)

also the title (cited from ML thread) is slightly different from what the eventually consistency actually means
and may give a wrong impression to someone so I'd like to change to the appropriate one.

@mocchira mocchira changed the title [leo_storage] eventually consistency can be violated [leo_storage] GET can be false positive more than necessary Dec 21, 2016
@mocchira mocchira changed the title [leo_storage] GET can be false positive more than necessary [leo_storage] GET can be false negative more than necessary Dec 21, 2016
@windkit
Copy link
Contributor

windkit commented Dec 21, 2016

@mocchira OK I got it now, i agree that would be simple solution.

@mocchira
Copy link
Member Author

@windkit thank you for sparing your time.

I updated #532 based on your comments( #553 (comment) ) about consistency.
I think there is still some edge cases that are explicitly not documented so please drop|update #532 if you noticed there is something to be explicit as spec or maybe to be fixed.

@mocchira
Copy link
Member Author

IMPL

windkit added a commit to windkit/leofs that referenced this issue Dec 5, 2017
@windkit
Copy link
Contributor

windkit commented Dec 5, 2017

PR: #948

windkit added a commit to windkit/leofs that referenced this issue Dec 6, 2017
windkit added a commit to windkit/leofs that referenced this issue Dec 6, 2017
windkit added a commit to windkit/leofs that referenced this issue Dec 6, 2017
mocchira pushed a commit that referenced this issue Dec 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants