Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consecutive snapshot synchronizations could not synchronize new entity snapshots #215

Closed
xirc opened this issue Oct 23, 2023 · 1 comment · Fixed by #217
Closed

Consecutive snapshot synchronizations could not synchronize new entity snapshots #215

xirc opened this issue Oct 23, 2023 · 1 comment · Fixed by #217

Comments

@xirc
Copy link
Contributor

xirc commented Oct 23, 2023

version: 2.3.0

Log diagnosis

The following log got in a fault injection test. The log suggests that an entity (running in replica-group-1) might be facing data inconsistency (Detecting event number constraint violation is described in https://github.com/lerna-stack/akka-entity-replication/blob/v2.3.0/docs/typed/implementation_guide.md#detecting-data-inconsistencies-by-entity-implementation):

06:13:23.767 TRACE   [Inactive] Received Activate: recoveryIndex=[9077], shardSnapshotStore=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/ShardSnapshotStore:1#721724213]]
06:13:23.767 TRACE   [Recovering] Sending FetchSnapshot: entityId=[0000000966], to=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/ShardSnapshotStore:1#721724213]]
06:13:23.767 TRACE   [Recovering] Starting single RecoveryTimeoutTimer: delay=[35 seconds]
06:13:23.767 TRACE   [Recovering] Stashing Replica: index=[9078], term=[8], entityId=[Some(0000000966)], eventType=[***]
06:13:23.767 TRACE   [Recovering] Received ApplySnapshot: index=[Some(2001)], entityId=[Some(0000000966)], stateType=[Some(***)]
06:13:23.767 TRACE   [Recovering] Sending FetchEntityEvents: entityId=[0000000966], fromIndex=[2002], toIndex=[9077], replyTo=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/0000000966/$$a-adapter#-394328591]], to=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1#-1477116895]]
06:13:23.768 TRACE   [Recovering] Received RecoveryState: snapshot.index=[Some(2001)], snapshot.entityId=[Some(0000000966)], snapshot.stateType=[Some(***)], events.size=[0], events.head=[None], events.last=[None]
06:13:23.768 TRACE   [Recovering] Recovering with initial state: index=[2001], stateType=[***]
06:13:23.768 TRACE   [Recovering] Recovered with state: index=[2001], stateType=[***]
06:13:23.768 TRACE   [Ready] Received Replica: index=[9078], term=[8], entityId=[Some(0000000966)], eventType=[***]
06:13:23.768 ERROR   [EntityId=0000000966] Violated event number constraint: eventNo=[6] should be equal to state.nextEventNo: [3]; event=[***]

The above log also shows the entity recovered with a snapshot (index=2001) and no events.
However, RaftActor replicated the following events for that entity as below:

05:42:36.278 akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/0000000966      [WaitForReplication] Received ReplicationSucceeded: index=[1464], instanceId=[Some(72848)], eventType=[***]
05:42:36.831 akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/0000000966      [WaitForReplication] Received ReplicationSucceeded: index=[1470], instanceId=[Some(72949)], eventType=[***]
05:42:36.982 akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/0000000966      [WaitForReplication] Received ReplicationSucceeded: index=[1471], instanceId=[Some(72979)], eventType=[***]
05:57:59.938 akka://***/system/sharding/raft-shard-***-replica-group-2/1/1/0000000966      [WaitForReplication] Received ReplicationSucceeded: index=[5298], instanceId=[Some(279516)], eventType=[***]
05:58:00.503 akka://***/system/sharding/raft-shard-***-replica-group-2/1/1/0000000966      [WaitForReplication] Received ReplicationSucceeded: index=[5304], instanceId=[Some(279621)], eventType=[***]
05:58:00.644 akka://***/system/sharding/raft-shard-***-replica-group-2/1/1/0000000966      [WaitForReplication] Received ReplicationSucceeded: index=[5305], instanceId=[Some(279646)], eventType=[***]
06:13:23.614 akka://***/system/sharding/raft-shard-***-replica-group-2/1/1/0000000966      [WaitForReplication] Received ReplicationSucceeded: index=[9078], instanceId=[Some(481219)], eventType=[***]
06:13:24.173 akka://***/system/sharding/raft-shard-***-replica-group-2/1/1/0000000966      [WaitForReplication] Received ReplicationSucceeded: index=[9083], instanceId=[Some(481322)], eventType=[***]
06:13:24.324 akka://***/system/sharding/raft-shard-***-replica-group-2/1/1/0000000966      [WaitForReplication] Received ReplicationSucceeded: index=[9085], instanceId=[Some(481351)], eventType=[***]

Before that happened, SnapshotSyncManager (running in replica-group-1) executed snapshot synchronizations as below:

05:54:25.283 INFO    Recovery completed: state=[SyncProgress(NoOffset)]
05:54:25.283 INFO    Snapshot synchronization started: (typeName: ***, memberIndex: replica-group-2, snapshotLastLogTerm: 8, snapshotLastLogIndex: 3875) -> (typeName: ***, memberIndex: replica-group-1, snapshotLastLogTerm: 2, snapshotLastLogIndex: 1013)
05:57:04.684 INFO    Stopping itself.
05:57:04.847 INFO    Recovery completed: state=[SyncProgress(NoOffset)]
05:57:04.847 INFO    Snapshot synchronization started: (typeName: ***, memberIndex: replica-group-2, snapshotLastLogTerm: 8, snapshotLastLogIndex: 4802) -> (typeName: ***, memberIndex: replica-group-1, snapshotLastLogTerm: 2, snapshotLastLogIndex: 1013)
05:59:55.112 INFO    Snapshot synchronization completed: (typeName: ***, memberIndex: replica-group-2) -> (typeName: ***, memberIndex: replica-group-1, snapshotLastLogTerm: 2, snapshotLastLogIndex: 1013)
05:59:55.113 INFO    Snapshot synchronization started: (typeName: ***, memberIndex: replica-group-2, snapshotLastLogTerm: 8, snapshotLastLogIndex: 5741) -> (typeName: ***, memberIndex: replica-group-1, snapshotLastLogTerm: 2, snapshotLastLogIndex: 1013)
05:59:55.114 INFO    Succeeded to saveSnapshot given metadata [SnapshotMetadata(SnapshotSyncManager:***:replica-group-2:replica-group-1:1, 3, 1697435995112, None)]
06:00:05.212 INFO    Snapshot synchronization completed: (typeName: ***, memberIndex: replica-group-2) -> (typeName: ***, memberIndex: replica-group-1, snapshotLastLogTerm: 2, snapshotLastLogIndex: 1013)
06:00:05.213 INFO    Succeeded to saveSnapshot given metadata [SnapshotMetadata(SnapshotSyncManager:***:replica-group-2:replica-group-1:1, 5, 1697436005212, None)]
06:00:05.213 INFO    Stopping itself.

A SnapshotStore (for the entity, running in replica-group-2) wrote snapshots as below:

05:44:29.922 DEBUG   Saved EntitySnapshot: entityId=[0000000966], logEntryIndex=[2001], stateType=[***]
05:59:48.773 DEBUG   Saved EntitySnapshot: entityId=[0000000966], logEntryIndex=[5741], stateType=[***]

However, a SnapshotStore (for the entity, running in replica-group-1) didn't write the snapshot (index=5741) as below:

05:59:37.347 DEBUG   Saved EntitySnapshot: entityId=[0000000966], logEntryIndex=[2001], stateType=[***]

The log below, which is newly added for this diagnosis, shows EntitySnapshotsUpdated events (i.e., CompactionCompleted) SnapshotSyncManager subscribed:

05:57:04.652 DEBUG   [entity-snapshots-updated-events] Element: EntitySnapshotsUpdated(snapshotLastLogTerm: 2, snapshotLastLogIndex: 1030, eventType: lerna.akka.entityreplication.raft.RaftActor$CompactionCompleted, persistenceId: raft:***:1:replica-group-2, sequenceNr: 1033, offset: TimeBasedUUID(90b42800-6be6-11ee-ad67-25a25fca1ba5))
05:57:04.657 DEBUG   [entity-snapshots-updated-events] Element: EntitySnapshotsUpdated(snapshotLastLogTerm: 8, snapshotLastLogIndex: 2001, eventType: lerna.akka.entityreplication.raft.RaftActor$CompactionCompleted, persistenceId: raft:***:1:replica-group-2, sequenceNr: 2014, offset: TimeBasedUUID(1336f410-6be7-11ee-ad67-25a25fca1ba5))
05:57:04.663 DEBUG   [entity-snapshots-updated-events] Element: EntitySnapshotsUpdated(snapshotLastLogTerm: 8, snapshotLastLogIndex: 2953, eventType: lerna.akka.entityreplication.raft.RaftActor$CompactionCompleted, persistenceId: raft:***:1:replica-group-2, sequenceNr: 2966, offset: TimeBasedUUID(a101c810-6be7-11ee-ad67-25a25fca1ba5))
05:57:04.670 DEBUG   [entity-snapshots-updated-events] Element: EntitySnapshotsUpdated(snapshotLastLogTerm: 8, snapshotLastLogIndex: 3875, eventType: lerna.akka.entityreplication.raft.RaftActor$CompactionCompleted, persistenceId: raft:***:1:replica-group-2, sequenceNr: 3890, offset: TimeBasedUUID(2bc72440-6be8-11ee-ad67-25a25fca1ba5))
05:57:04.670 DEBUG   [entity-snapshots-updated-events] Downstream finished, cause: SubscriptionWithCancelException$StageWasCompleted$: null
05:59:16.036 DEBUG   [entity-snapshots-updated-events] Element: EntitySnapshotsUpdated(snapshotLastLogTerm: 2, snapshotLastLogIndex: 1030, eventType: lerna.akka.entityreplication.raft.RaftActor$CompactionCompleted, persistenceId: raft:***:1:replica-group-2, sequenceNr: 1033, offset: TimeBasedUUID(90b42800-6be6-11ee-ad67-25a25fca1ba5))
05:59:16.046 DEBUG   [entity-snapshots-updated-events] Element: EntitySnapshotsUpdated(snapshotLastLogTerm: 8, snapshotLastLogIndex: 2001, eventType: lerna.akka.entityreplication.raft.RaftActor$CompactionCompleted, persistenceId: raft:***:1:replica-group-2, sequenceNr: 2014, offset: TimeBasedUUID(1336f410-6be7-11ee-ad67-25a25fca1ba5))
05:59:16.053 DEBUG   [entity-snapshots-updated-events] Element: EntitySnapshotsUpdated(snapshotLastLogTerm: 8, snapshotLastLogIndex: 2953, eventType: lerna.akka.entityreplication.raft.RaftActor$CompactionCompleted, persistenceId: raft:***:1:replica-group-2, sequenceNr: 2966, offset: TimeBasedUUID(a101c810-6be7-11ee-ad67-25a25fca1ba5))
05:59:16.058 DEBUG   [entity-snapshots-updated-events] Element: EntitySnapshotsUpdated(snapshotLastLogTerm: 8, snapshotLastLogIndex: 3875, eventType: lerna.akka.entityreplication.raft.RaftActor$CompactionCompleted, persistenceId: raft:***:1:replica-group-2, sequenceNr: 3890, offset: TimeBasedUUID(2bc72440-6be8-11ee-ad67-25a25fca1ba5))
05:59:16.059 DEBUG   [entity-snapshots-updated-events] Downstream finished, cause: SubscriptionWithCancelException$StageWasCompleted$: null
05:59:41.724 DEBUG   [entity-snapshots-updated-events] Element: EntitySnapshotsUpdated(snapshotLastLogTerm: 8, snapshotLastLogIndex: 3875, eventType: lerna.akka.entityreplication.raft.RaftActor$CompactionCompleted, persistenceId: raft:***:1:replica-group-2, sequenceNr: 3890, offset: TimeBasedUUID(2bc72440-6be8-11ee-ad67-25a25fca1ba5))
05:59:41.727 DEBUG   [entity-snapshots-updated-events] Element: EntitySnapshotsUpdated(snapshotLastLogTerm: 8, snapshotLastLogIndex: 4802, eventType: lerna.akka.entityreplication.raft.RaftActor$CompactionCompleted, persistenceId: raft:***:1:replica-group-2, sequenceNr: 4819, offset: TimeBasedUUID(ae7c9910-6be8-11ee-ad67-25a25fca1ba5))
05:59:46.750 DEBUG   [entity-snapshots-updated-events] Upstream finished.
05:59:55.123 DEBUG   [entity-snapshots-updated-events] Element: EntitySnapshotsUpdated(snapshotLastLogTerm: 8, snapshotLastLogIndex: 5741, eventType: lerna.akka.entityreplication.raft.RaftActor$CompactionCompleted, persistenceId: raft:***:1:replica-group-2, sequenceNr: 5757, offset: TimeBasedUUID(36e567a0-6be9-11ee-ad67-25a25fca1ba5))
06:00:00.139 DEBUG   [entity-snapshots-updated-events] Upstream finished.

Also, the log below, which is also newly added for this diagnosis, shows entity snapshots SnapshotSyncManager copied:

05:59:37.255 DEBUG   Copying EntitySnapshot: entityId=[0000000966], from=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/SnapshotSyncManager:***/$a#-2034853661]], to=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/ShardSnapshotStore:1#721724213]]
05:59:37.348 DEBUG   Copied EntitySnapshot: entityId=[0000000966], from=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/SnapshotSyncManager:***/$a#-2034853661]], to=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/ShardSnapshotStore:1#721724213]], sourceEntitySnapshotMetadata=[EntitySnapshotMetadata(NormalizedEntityId(0000000966),2001)], destinationEntitySnapshotMetadata=[EntitySnapshotMetadata(NormalizedEntityId(0000000966),2001)]
06:00:05.089 DEBUG   Copying EntitySnapshot: entityId=[0000000966], from=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/SnapshotSyncManager:***/$a#-2034853661]], to=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/ShardSnapshotStore:1#721724213]]
06:00:05.089 DEBUG   Copied EntitySnapshot: entityId=[0000000966], from=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/SnapshotSyncManager:***/$a#-2034853661]], to=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/ShardSnapshotStore:1#721724213]], sourceEntitySnapshotMetadata=[EntitySnapshotMetadata(NormalizedEntityId(0000000966),2001)], destinationEntitySnapshotMetadata=[EntitySnapshotMetadata(NormalizedEntityId(0000000966),2001)]

The above logs suggest that SnapshotSyncManager read an old entity snapshot from a source SnapshotStore.

Note that RaftActor (running in replica-group-2) executed Raft log compactions as below:

05:40:50.718 INFO    [Follower] compaction started (logEntryIndex: 1030, number of entities: 339)
05:40:51.081 INFO    [Follower] compaction completed (term: Term(2), logEntryIndex: 1030)
05:44:29.638 INFO    [Leader] compaction started (logEntryIndex: 2001, number of entities: 299)
05:44:30.026 INFO    [Leader] compaction completed (term: Term(8), logEntryIndex: 2001)
05:48:27.318 INFO    [Leader] compaction started (logEntryIndex: 2953, number of entities: 336)
05:48:28.103 INFO    [Leader] compaction completed (term: Term(8), logEntryIndex: 2953)
05:52:20.308 INFO    [Leader] compaction started (logEntryIndex: 3875, number of entities: 306)
05:52:20.728 INFO    [Leader] compaction completed (term: Term(8), logEntryIndex: 3875)
05:55:59.609 INFO    [Leader] compaction started (logEntryIndex: 4802, number of entities: 309)
05:56:00.031 INFO    [Leader] compaction completed (term: Term(8), logEntryIndex: 4802)
05:59:48.508 INFO    [Leader] compaction started (logEntryIndex: 5741, number of entities: 314)
05:59:48.878 INFO    [Leader] compaction completed (term: Term(8), logEntryIndex: 5741)
06:03:38.708 INFO    [Leader] compaction started (logEntryIndex: 6691, number of entities: 317)
06:03:39.058 INFO    [Leader] compaction completed (term: Term(8), logEntryIndex: 6691)
06:07:34.478 INFO    [Leader] compaction started (logEntryIndex: 7621, number of entities: 311)
06:07:34.848 INFO    [Leader] compaction completed (term: Term(8), logEntryIndex: 7621)
06:11:10.428 INFO    [Leader] compaction started (logEntryIndex: 8522, number of entities: 300)
06:11:10.829 INFO    [Leader] compaction completed (term: Term(8), logEntryIndex: 8522)

Scenario

The log diagnosis suggests the following scenario happened. Suppose the following sequence:

  1. SnapshotSyncManager in replica-group-1 is synchronizing entity snapshots from replica-group-2.
  2. During the synchronization, RaftActor in replica-group-2 executed Raft log compaction.
  3. After the SnapshotSyncManager completes the synchronization, it will start a new snapshot synchronization.

If a new SnapshotSyncManager instance (is different from the instance that ran the first synchronization) handles the new snapshot synchronization, the new synchronization succeeds. However, if the same SnapshotSyncManager instance runs these synchronizations, the new synchronization sometimes cannot synchronize new entity snapshots.

The following sequence diagram illustrates the details of this scenario:

sequenceDiagram
    autonumber
%% Actors
    participant RaftActor1 as RaftActor<br/>(replica-group-1)
    participant SnapshotSyncManager as SnapshotSyncManager<br/>(replica-group-1)
    participant ShardSnapshotStore1 as ShardSnapshotStore<br/>(replica-group-1)
    participant SnapshotStore1 as SnapshotStore<br/>(replica-group-1, EntityId=1)
    participant ShardSnapshotStore2ForSync as ShardSnapshotStore<br/>(replica-group-2)
    participant SnapshotStore2ForSync as SnapshotStore<br/>(replica-group-2, EntityId=1)
    participant PersistentDataStore as Persistent Data Store<br/>(i.e. Cassandra)
    participant RaftActor2 as RaftActor<br/>(replica-group-2)
    participant ShardSnapshotStore2 as ShardSnapshotStore<br/>(replica-group-2)
    participant SnapshotStore2 as SnapshotStore<br/>(replica-group-2, EntityId=1)
%% Snapshot Synchronization
    RaftActor1 --> RaftActor1: Attempt to start snapshot synchronization (source=replica-group-2)
    activate RaftActor1
    RaftActor1 --> SnapshotSyncManager: Spawn SnapshotSyncManager actor
    activate SnapshotSyncManager
    RaftActor1 ->> SnapshotSyncManager: Send SyncSnapshot
    SnapshotSyncManager --> PersistentDataStore: Recovery from persisted data
    SnapshotSyncManager --> SnapshotSyncManager: Start snapshot synchronization
    activate SnapshotSyncManager
    SnapshotSyncManager --> ShardSnapshotStore2ForSync: Spawn ShardSnapshotStore
    SnapshotSyncManager ->> ShardSnapshotStore2ForSync: Send FetchSnapshot
    ShardSnapshotStore2ForSync --> SnapshotStore2ForSync: Spawn SnapshotStore
    ShardSnapshotStore2ForSync ->> SnapshotStore2ForSync: Forward FetchSnapshot
    SnapshotStore2ForSync --> PersistentDataStore: Recovery from persisted data
    SnapshotStore2ForSync ->> SnapshotSyncManager: Reply SnapshotFound

    par Reading EntitySnapshot from the persistent data store
        SnapshotSyncManager ->> ShardSnapshotStore1: Send SaveSnapshot
        ShardSnapshotStore1 --> SnapshotStore1: Spawn SnapshotStore
        SnapshotStore1 --> PersistentDataStore: Recovery from persisted data
        ShardSnapshotStore1 ->> SnapshotStore1: Forward SaveSnapshot
        SnapshotStore1 -->> PersistentDataStore: Persist EntitySnapshot
        SnapshotStore1 ->> SnapshotSyncManager: Reply SaveSnapshotSuccess
        SnapshotSyncManager -->> PersistentDataStore: Persist SyncCompleted
        SnapshotSyncManager ->> RaftActor1: SyncSnapshotSucceeded
        deactivate SnapshotSyncManager
        deactivate RaftActor1

    and Writing EntitySnapshot to the persistent data store
    %% Compaction
        RaftActor2 --> RaftActor2: Start compaction
        activate RaftActor2
        RaftActor2 ->> ShardSnapshotStore2: Send SaveSnapshot
        ShardSnapshotStore2 ->> SnapshotStore2: Forward SaveSnapshot
        SnapshotStore2 --> PersistentDataStore: Recovery from persisted data
        SnapshotStore2 -->> PersistentDataStore: Persist EntitySnapshot
        Note over SnapshotStore2, PersistentDataStore: Another SnapshotStore (replica-group-2, EntityId=1) doesn't read this persisted EntitySnapshot<br/>if it has already been recovered.
        SnapshotStore2 ->> RaftActor2: Reply SaveSnapshotSuccess
        RaftActor2 -->> PersistentDataStore: Persist CompactionCompleted
        deactivate RaftActor2
        RaftActor2 -->> RaftActor1: Send heartbeats, which tells RaftActor1 must start a new snapshot synchronization.
    %% Snapshot Synchronization
        RaftActor1 --> RaftActor1: Attempt to start snapshot synchronization (source=replica-group-2)
        activate RaftActor1
        RaftActor1 ->> SnapshotSyncManager: SyncSnapshot

    end

    SnapshotSyncManager --> SnapshotSyncManager: Start snapshot synchronization
    activate SnapshotSyncManager
    SnapshotSyncManager ->> ShardSnapshotStore2ForSync: Send FetchSnapshot
    ShardSnapshotStore2ForSync ->> SnapshotStore2ForSync: Forward FetchSnapshot
    Note over SnapshotStore2ForSync: SnapshotStore (replica-group-2, EntityId=1) returns the cached EntitySnapshot.<br/>It doesn't return newly written data that another one writes after that caching.
    SnapshotStore2ForSync ->> SnapshotSyncManager: Reply SnapshotFound
    SnapshotSyncManager ->> ShardSnapshotStore1: Send SaveSnapshot
    ShardSnapshotStore1 ->> SnapshotStore1: Forward SaveSnapshot
    SnapshotStore1 -->> SnapshotStore1: Skip the EntitySnapshot save if it has already saved the snapshot.
    SnapshotStore1 ->> SnapshotSyncManager: Reply SaveSnapshotSuccess
    SnapshotSyncManager -->> PersistentDataStore: Persist SyncCompleted
    SnapshotStore1 ->> SnapshotSyncManager: Reply SaveSnapshotSuccess
    SnapshotSyncManager ->> RaftActor1: SyncSnapshotSucceeded
    deactivate SnapshotSyncManager
    deactivate SnapshotSyncManager
    deactivate RaftActor1
Loading

The following figure shows the hierarchy of these actors. While the SnapshotSyncManager in replica-group-1 spawns a source SnapshotStore as its child for fetching an entity snapshot, RaftActor in replica-group-2 spawns another SnapshotStore with the same persistence ID for compaction.

flowchart
%% Actors
    RaftActor1["RaftActor (replica-group-1)"]
    SnapshotSyncManager["SnapshotSyncManager (replica-group-1)"]
    ShardSnapshotStore1["ShardSnapshotStore (replica-group-1)"]
    SnapshotStore1["SnapshotStore (replica-group-1, EntityId=1)"]
    ShardSnapshotStore2ForSync["ShardSnapshotStore (replica-group-2)"]
    SnapshotStore2ForSync["SnapshotStore (replica-group-2, EntityId=1)"]
    RaftActor2["RaftActor (replica-group-2)"]
    ShardSnapshotStore2["ShardSnapshotStore (replica-group-2)"]
    SnapshotStore2["SnapshotStore (replica-group-2, EntityId=1)"]
%% Hierarchy
    RaftActor1 --- SnapshotSyncManager
    RaftActor1 --- ShardSnapshotStore1
    ShardSnapshotStore1 --- SnapshotStore1
    subgraph Run for Snapshot Synchronization
        SnapshotSyncManager --- ShardSnapshotStore2ForSync
        ShardSnapshotStore2ForSync --- SnapshotStore2ForSync
    end
    RaftActor2 --- ShardSnapshotStore2
    ShardSnapshotStore2 --- SnapshotStore2
Loading

A SnapshotStore caches an entity snapshot in memory when it replays events from a persistent store. If a SnapshotSyncManager reuses the same SnapshotStore instance, it will read the cached entity snapshot,
which could be old, and then write the old one to a destination SnapshotStore.

Considrations

Possible solutions might be the following:

  • SnapshotSyncManager doesn't reuse any SnapshotStore instance.
    It will start SnapshotStore instances when it starts a snapshot synchronization and
    stop the instances when it completes the synchronization.
  • SnapshotSyncManager handles only one snapshot synchronization.
    Another new snapshot synchronization will be handled by another new SnapshotSyncManager instance.
  • SnapshotSyncManager doesn't spawn any SnapshotStore as its child.
    Instead, it interacts with the stores that run on the node in the source replica group such that
    only one SnapshotStore with the same persistence ID runs in the cluster at the same time.
@xirc
Copy link
Contributor Author

xirc commented Oct 23, 2023

Some diagnostic logs of SnapshotSyncManager are introduced by PR #216

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant