Skip to content
This repository has been archived by the owner on Dec 19, 2017. It is now read-only.

Handle inconsistencies in command response event indexes #293

Merged
merged 3 commits into from
Mar 3, 2017

Conversation

kuujo
Copy link
Member

@kuujo kuujo commented Mar 3, 2017

This PR fixes an issue in Copycat's ClientSequencer that can prevent the client from progressing when inconsistencies exist in server-side session sequences.

This is not so much an issue in the client sequencer as it is an issue with non-deterministic event sequences in state machines. This problem can occur when a client switches from a server with one set of events to a server with a different set of events. Once an event sequence is sent to the client from the first server, if the client switches servers and expects an event that doesn't exist on the new server, that can result in a live lock in the ClientSequencer.

The specific scenario that led to the discovery of this bug is as follows:

The client receives the following response from server A:

2017-03-01 21:24:21,653   | ClientSession             | 77377 - Received CommandResponse[status=OK, error=null, index=140761, eventIndex=140711, result=null]

Thereafter, the client switches to a new server which doesn't have any event at index 140711. Instead, if gets a PublishRequest with previousIndex=140709:

2017-03-01 21:24:21,730   | ClientSession             | 77377 - Received PublishRequest[session=77377, eventIndex=140761, previousIndex=140709, events=[Event[event=leadershipChangeEvents, message=InstanceEvent[resource=74, message=[Change{oldValue=Leadership{topic=work-partition-7, leader=Leader{nodeId=172.17.0.3, term=22, termStartTime=1488403219390}, candidates=[172.17.0.3, 172.17.0.4, 172.17.0.2]}, newValue=Leadership{topic=work-partition-7, leader=Leader{nodeId=172.17.0.4, term=23, termStartTime=1488403461730}, candidates=[172.17.0.4, 172.17.0.2]}}]]]]]

However, the client is waiting on the event at index 140711 to complete the response, and that event will never come (at least not from the server to which it's connected).

Ideally, state machines will ensure that events are published deterministically. While Copycat's replication protocol allows commands to be excluded from replication once they've been released, Copycat already ensures that all commands that create events that have not yet been acknowledged by all clients will be replicated. That is, when a command publishes events, the command that created the events will be live and continue to be replicated until the events have been acknowledged by all sessions. That should ensure that any server to which a client is connected will have a consistent sequence of events. But in the event a state machine does not publish events deterministically, consistency checks on the client can result in the client waiting for events that will never come. During that time, the client cannot complete either events or responses.

So, this has been a known issue in the past, and 2b6f47f actually prevented this live lock from occurring with respect to PublishRequest. If a client switched servers and received a PublishRequest from a server that didn't have the client's last event, the server can set previousIndex to the client's last indicated received event to ensure that events can be sequenced on the client. However, this only fixed part of the issue. eventIndex is also sent in responses to the client. So, if a client receives an eventIndex in a response and then switches to a server that doesn't have an event at that index, the client will not be able to sequence events.

The fix for this is to simply check events in the sequencer's queue to detect gaps in expected events during sequencing. When the client is sequencing a response, if the client doesn't have the event prior to the response, it checks the events it has received to determine whether a PublishRequest has a previousIndex that matches the sequencer's current eventIndex. This allows the Math.max call to reset the state inside the sequencer and allow it to sequence operation responses that have a missing eventIndex.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling e20f732 on client-sequencer-missing-response-event into ** on master**.

@kuujo kuujo merged commit 875b923 into master Mar 3, 2017
@kuujo kuujo deleted the client-sequencer-missing-response-event branch March 3, 2017 11:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants