Snapshot synchronization could remove committed log entries that not be included in snapshots #167

xirc · 2022-08-01T08:34:11Z

It happened in some fault injection tests.

An entity (called as entity X) on RaftActor (replica-group-2) got data inconsistency:

Recovery logs of entity X (~ 08:05:24.836):
1. Entity X started a recovery...
2. Entity X received an ApplySnapshot(entitySnapshot=[None]) message
3. Entity X sent a FetchEntityEvents(..., from=[1], to=[4179], ...) message
4. Entity X received a RecoveryState(snapshot=[None], events=([11] entries)) message
5. ...
Recovery logs of entity X (~ 08:16:10.985):
1. Entity X started a recovery...
2. Entity X received an ApplySnapshot(entitySnapshot=[None]) message
3. Entity X sent a FetchEntityEvents(..., from=[1], to=[4179], ...)
4. Entity X received a RecoveryState(snapshot=[None], events=([0] entries)) message
5. ...

On the second recovery, entity X didn't receive events that the first recovery contained, which means that entity X got data inconsistency.

On the other hand, RaftActor (replica-group-2) started snapshot synchronization like the following:

08:05:03.392: RaftActor (replica-group-2) started snapshot synchronization:
- [Follower] Applying event [SnapshotSyncStarted], state diff: [lastSnapshotStatus: SnapshotStatus(Term(16),3730,Term(16),3730) -> SnapshotStatus(Term(16),3730,Term(17),3746)]
08:16:10.517: RaftActor (replica-group-2) completed snapshot synchronization:
- [Follower] Applying event [SnapshotSyncCompleted], state diff: [replicatedLog: ReplicatedLog(ancestorTerm=Term(14), ancestorIndex=3630, 549 entries with indices Some(3631)...Some(4179)) -> ReplicatedLog(ancestorTerm=Term(17), ancestorIndex=3746, 0 entries with indices None...None), lastSnapshotStatus: SnapshotStatus(Term(16),3730,Term(17),3746) -> SnapshotStatus(Term(17),3746,Term(17),3746)]

RaftActor (replica-group-2) committed entries (indices 3746 ~ 3748) at 08:02:27.449.
The above snapshot synchronization removed committed log entries that not be included in snapshots.

RaftActor (replica-group-1) was the leader and updated indices for replica-group-2 like the following:

08:04:13.090: Applying event [SucceededAppendEntries]: next index = 3952 -> 3953, match index = 3951 -> 3952
08:04:14.210: Applying event [BecameLeader]: match index = 3953 -> None, match index = 3952 -> None
08:04:50.558: Applying event [DeniedAppendEntries]: next index = None -> 4058
08:04:50.558: Applying event [DeniedAppendEntries]: next index = 4058 -> 4057
...
08:05:05.632: Applying event [DeniedAppendEntries]: next index = 3213 -> 3212

The next index was lower than expected, like the situation described on #165 (comment)

The text was updated successfully, but these errors were encountered:

xirc · 2022-08-01T08:39:56Z

Like #165 (comment), there might be at least two possible solutions:

Improve a mechanism for decrementing the next index
Improve receiving InstallSnapshot messages
- RaftActor can skip snapshot synchronization and reply with InstallSnapshotSucceeded immediately, which doesn't remove committed log entry that not be included in snapshots.

xirc added this to the v2.1.1 milestone Aug 1, 2022

xirc mentioned this issue Aug 3, 2022

Skip unnecessary snapshot synchronization #168

Merged

negokaz closed this as completed in #168 Aug 4, 2022

xirc added the bug Something isn't working label Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot synchronization could remove committed log entries that not be included in snapshots #167

Snapshot synchronization could remove committed log entries that not be included in snapshots #167

xirc commented Aug 1, 2022

xirc commented Aug 1, 2022

Snapshot synchronization could remove committed log entries that not be included in snapshots #167

Snapshot synchronization could remove committed log entries that not be included in snapshots #167

Comments

xirc commented Aug 1, 2022

xirc commented Aug 1, 2022