-
Notifications
You must be signed in to change notification settings - Fork 880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover from transient gossip failures #1446
Conversation
@mrjana In the working state if I shut the interface on the worker that is being used for the connectivity with the manager the gossip cluster fails. If I bring it up even after a minute gossip cluster seems to be setup correctly; ie: new services created across nodes can be reached. Without the changes from this PR is there some other sequence that can re-establish the gossip cluster ? In this case gRPC session also gets created again. Not sure if that is playing a role. |
@sanimej if you directly shutdown the interface on the node, it clears the ip address on the interface and so there will be a direct error on golang api to send the UDP packet out. When that happens memberlist does not consider(correctly) that as a remote node failure but rather a problem on the local end and will keep retrying. Once you re-establish the link, the probe will succeed and everything will work. If you want to simulate remote node failure you need to make packets drop or bring down the switch/bridge which connects these nodes. |
// has these in leaving/deleting state still. This is | ||
// facilitate fast convergence after recovering from a gossip | ||
// failure. | ||
nDB.updateLocalStateTime() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we update the local state to a new time shouldn't the updates be sent to all nodes, including the ones still in the cluster ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This update is just make sure to achieve fast convergence on the node which is recovering from failure. Since it doesn't have any real state change there is no need to update all the nodes in the cluster.
Currently if there is any transient gossip failure in any node the recoevry process depends on other nodes propogating the information indirectly. In cases if these transient failures affects all the nodes that this node has in its memberlist then this node will be permenantly cutoff from the the gossip channel. Added node state management code in networkdb to address these problems by trying to rejoin the cluster via the failed nodes when there is a failure. This also necessitates the need to add new messages called node event messages to differentiate between node leave and node failure. Signed-off-by: Jana Radhakrishnan <[email protected]>
LGTM |
Currently if there is any transient gossip failure in any node the
recoevry process depends on other nodes propogating the information
indirectly. In cases if these transient failures affects all the nodes
that this node has in its memberlist then this node will be permenantly
cutoff from the the gossip channel. Added node state management code in
networkdb to address these problems by trying to rejoin the cluster via
the failed nodes when there is a failure. This also necessitates the
need to add new messages called node event messages to differentiate
between node leave and node failure.
Signed-off-by: Jana Radhakrishnan [email protected]