This repository has been archived by the owner on Dec 19, 2017. It is now read-only.
Handle inconsistencies in command response event indexes #293
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes an issue in Copycat's
ClientSequencer
that can prevent the client from progressing when inconsistencies exist in server-side session sequences.This is not so much an issue in the client sequencer as it is an issue with non-deterministic event sequences in state machines. This problem can occur when a client switches from a server with one set of events to a server with a different set of events. Once an event sequence is sent to the client from the first server, if the client switches servers and expects an event that doesn't exist on the new server, that can result in a live lock in the
ClientSequencer
.The specific scenario that led to the discovery of this bug is as follows:
The client receives the following response from server A:
Thereafter, the client switches to a new server which doesn't have any event at index
140711
. Instead, if gets aPublishRequest
withpreviousIndex=140709
:However, the client is waiting on the event at index
140711
to complete the response, and that event will never come (at least not from the server to which it's connected).Ideally, state machines will ensure that events are published deterministically. While Copycat's replication protocol allows commands to be excluded from replication once they've been released, Copycat already ensures that all commands that create events that have not yet been acknowledged by all clients will be replicated. That is, when a command publishes events, the command that created the events will be live and continue to be replicated until the events have been acknowledged by all sessions. That should ensure that any server to which a client is connected will have a consistent sequence of events. But in the event a state machine does not publish events deterministically, consistency checks on the client can result in the client waiting for events that will never come. During that time, the client cannot complete either events or responses.
So, this has been a known issue in the past, and 2b6f47f actually prevented this live lock from occurring with respect to
PublishRequest
. If a client switched servers and received aPublishRequest
from a server that didn't have the client's last event, the server can setpreviousIndex
to the client's last indicated received event to ensure that events can be sequenced on the client. However, this only fixed part of the issue.eventIndex
is also sent in responses to the client. So, if a client receives aneventIndex
in a response and then switches to a server that doesn't have an event at that index, the client will not be able to sequence events.The fix for this is to simply check events in the sequencer's queue to detect gaps in expected events during sequencing. When the client is sequencing a response, if the client doesn't have the event prior to the response, it checks the events it has received to determine whether a
PublishRequest
has apreviousIndex
that matches the sequencer's currenteventIndex
. This allows theMath.max
call to reset the state inside the sequencer and allow it to sequence operation responses that have a missingeventIndex
.