-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race in Cluster.onConnect that leaves cluster in shutdown state indefinitely? #18
Comments
I can now trip this somewhat reliably in my own code (unpublished) by starting a handful of nodes (~5-10) and putting them through a short series of brief zk connection losses. I think the root problem is that the code sometimes uses separate Here's one simple example of where this problem can occur:
In the case where the two separate |
As for a fix, I'm imaging something along these lines:
Then any |
This bug seems to amplify a ginormous design flaw in Twitter's Zookeeper API, that is, creating a method that do two things (retrieve or create and retrieve) instead of one. I proposed a fix some time ago, but job commitments got in my way. Your bug pitch shows this question should be addressed as soon as possible! Maybe we can start at ZKClient and move up the stack? |
I don't have hard evidence for this, but I think the following is possible:
SyncConnected
for sessionid"a"
onConnect()
"a"
joinCluster()
zk.get().getSessionId
establishes new session with sessionid"b"
and returns"b"
/<name>/nodes/<nodeID>
created with connectionID"b"
, lifetime bound to sessionid"b"
joinCluster()
,onConnect()
,SyncConnected
SyncConnected
for sessionid"b"
onConnect()
previousZKSessionStillActive()
previousZKSessionStillActive()
uses session"b"
to read/<name>/nodes/<nodeID>
, finds connectionID"b"
, returnstrue
onConnect()
returns early, skipping cluster setupAnd somewhere in there an
Expired
event for sessionid"a"
shuts down the cluster (I'm not sure if it matters when), leaving the cluster in a shutdown state indefinitely becauseonConnect()
for sessionid"b"
was tricked into skipping cluster setup.The text was updated successfully, but these errors were encountered: