Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot synchronization could remove committed log entries that not be included in snapshots #167

Closed
xirc opened this issue Aug 1, 2022 · 1 comment · Fixed by #168
Closed
Labels
bug Something isn't working
Milestone

Comments

@xirc
Copy link
Contributor

xirc commented Aug 1, 2022

It happened in some fault injection tests.

An entity (called as entity X) on RaftActor (replica-group-2) got data inconsistency:

  • Recovery logs of entity X (~ 08:05:24.836):
    1. Entity X started a recovery...
    2. Entity X received an ApplySnapshot(entitySnapshot=[None]) message
    3. Entity X sent a FetchEntityEvents(..., from=[1], to=[4179], ...) message
    4. Entity X received a RecoveryState(snapshot=[None], events=([11] entries)) message
    5. ...
  • Recovery logs of entity X (~ 08:16:10.985):
    1. Entity X started a recovery...
    2. Entity X received an ApplySnapshot(entitySnapshot=[None]) message
    3. Entity X sent a FetchEntityEvents(..., from=[1], to=[4179], ...)
    4. Entity X received a RecoveryState(snapshot=[None], events=([0] entries)) message
    5. ...

On the second recovery, entity X didn't receive events that the first recovery contained, which means that entity X got data inconsistency.

On the other hand, RaftActor (replica-group-2) started snapshot synchronization like the following:

  • 08:05:03.392: RaftActor (replica-group-2) started snapshot synchronization:
    • [Follower] Applying event [SnapshotSyncStarted], state diff: [lastSnapshotStatus: SnapshotStatus(Term(16),3730,Term(16),3730) -> SnapshotStatus(Term(16),3730,Term(17),3746)]
  • 08:16:10.517: RaftActor (replica-group-2) completed snapshot synchronization:
    • [Follower] Applying event [SnapshotSyncCompleted], state diff: [replicatedLog: ReplicatedLog(ancestorTerm=Term(14), ancestorIndex=3630, 549 entries with indices Some(3631)...Some(4179)) -> ReplicatedLog(ancestorTerm=Term(17), ancestorIndex=3746, 0 entries with indices None...None), lastSnapshotStatus: SnapshotStatus(Term(16),3730,Term(17),3746) -> SnapshotStatus(Term(17),3746,Term(17),3746)]

RaftActor (replica-group-2) committed entries (indices 3746 ~ 3748) at 08:02:27.449.
The above snapshot synchronization removed committed log entries that not be included in snapshots.

RaftActor (replica-group-1) was the leader and updated indices for replica-group-2 like the following:

  • 08:04:13.090: Applying event [SucceededAppendEntries]: next index = 3952 -> 3953, match index = 3951 -> 3952
  • 08:04:14.210: Applying event [BecameLeader]: match index = 3953 -> None, match index = 3952 -> None
  • 08:04:50.558: Applying event [DeniedAppendEntries]: next index = None -> 4058
  • 08:04:50.558: Applying event [DeniedAppendEntries]: next index = 4058 -> 4057
  • ...
  • 08:05:05.632: Applying event [DeniedAppendEntries]: next index = 3213 -> 3212

The next index was lower than expected, like the situation described on #165 (comment)

@xirc
Copy link
Contributor Author

xirc commented Aug 1, 2022

Like #165 (comment), there might be at least two possible solutions:

  1. Improve a mechanism for decrementing the next index
  2. Improve receiving InstallSnapshot messages
    • RaftActor can skip snapshot synchronization and reply with InstallSnapshotSucceeded immediately, which doesn't remove committed log entry that not be included in snapshots.

@xirc xirc added this to the v2.1.1 milestone Aug 1, 2022
@xirc xirc added the bug Something isn't working label Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant