You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
06:13:23.767 TRACE [Inactive] Received Activate: recoveryIndex=[9077], shardSnapshotStore=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/ShardSnapshotStore:1#721724213]]
06:13:23.767 TRACE [Recovering] Sending FetchSnapshot: entityId=[0000000966], to=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/ShardSnapshotStore:1#721724213]]
06:13:23.767 TRACE [Recovering] Starting single RecoveryTimeoutTimer: delay=[35 seconds]
06:13:23.767 TRACE [Recovering] Stashing Replica: index=[9078], term=[8], entityId=[Some(0000000966)], eventType=[***]
06:13:23.767 TRACE [Recovering] Received ApplySnapshot: index=[Some(2001)], entityId=[Some(0000000966)], stateType=[Some(***)]
06:13:23.767 TRACE [Recovering] Sending FetchEntityEvents: entityId=[0000000966], fromIndex=[2002], toIndex=[9077], replyTo=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/0000000966/$$a-adapter#-394328591]], to=[Actor[akka://***/system/sharding/raft-shard-***-replica-group-1/1/1#-1477116895]]
06:13:23.768 TRACE [Recovering] Received RecoveryState: snapshot.index=[Some(2001)], snapshot.entityId=[Some(0000000966)], snapshot.stateType=[Some(***)], events.size=[0], events.head=[None], events.last=[None]
06:13:23.768 TRACE [Recovering] Recovering with initial state: index=[2001], stateType=[***]
06:13:23.768 TRACE [Recovering] Recovered with state: index=[2001], stateType=[***]
06:13:23.768 TRACE [Ready] Received Replica: index=[9078], term=[8], entityId=[Some(0000000966)], eventType=[***]
06:13:23.768 ERROR [EntityId=0000000966] Violated event number constraint: eventNo=[6] should be equal to state.nextEventNo: [3]; event=[***]
The above log also shows the entity recovered with a snapshot (index=2001) and no events.
However, RaftActor replicated the following events for that entity as below:
05:42:36.278 akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/0000000966 [WaitForReplication] Received ReplicationSucceeded: index=[1464], instanceId=[Some(72848)], eventType=[***]
05:42:36.831 akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/0000000966 [WaitForReplication] Received ReplicationSucceeded: index=[1470], instanceId=[Some(72949)], eventType=[***]
05:42:36.982 akka://***/system/sharding/raft-shard-***-replica-group-1/1/1/0000000966 [WaitForReplication] Received ReplicationSucceeded: index=[1471], instanceId=[Some(72979)], eventType=[***]
05:57:59.938 akka://***/system/sharding/raft-shard-***-replica-group-2/1/1/0000000966 [WaitForReplication] Received ReplicationSucceeded: index=[5298], instanceId=[Some(279516)], eventType=[***]
05:58:00.503 akka://***/system/sharding/raft-shard-***-replica-group-2/1/1/0000000966 [WaitForReplication] Received ReplicationSucceeded: index=[5304], instanceId=[Some(279621)], eventType=[***]
05:58:00.644 akka://***/system/sharding/raft-shard-***-replica-group-2/1/1/0000000966 [WaitForReplication] Received ReplicationSucceeded: index=[5305], instanceId=[Some(279646)], eventType=[***]
06:13:23.614 akka://***/system/sharding/raft-shard-***-replica-group-2/1/1/0000000966 [WaitForReplication] Received ReplicationSucceeded: index=[9078], instanceId=[Some(481219)], eventType=[***]
06:13:24.173 akka://***/system/sharding/raft-shard-***-replica-group-2/1/1/0000000966 [WaitForReplication] Received ReplicationSucceeded: index=[9083], instanceId=[Some(481322)], eventType=[***]
06:13:24.324 akka://***/system/sharding/raft-shard-***-replica-group-2/1/1/0000000966 [WaitForReplication] Received ReplicationSucceeded: index=[9085], instanceId=[Some(481351)], eventType=[***]
Before that happened, SnapshotSyncManager (running in replica-group-1) executed snapshot synchronizations as below:
05:54:25.283 INFO Recovery completed: state=[SyncProgress(NoOffset)]
05:54:25.283 INFO Snapshot synchronization started: (typeName: ***, memberIndex: replica-group-2, snapshotLastLogTerm: 8, snapshotLastLogIndex: 3875) -> (typeName: ***, memberIndex: replica-group-1, snapshotLastLogTerm: 2, snapshotLastLogIndex: 1013)
05:57:04.684 INFO Stopping itself.
05:57:04.847 INFO Recovery completed: state=[SyncProgress(NoOffset)]
05:57:04.847 INFO Snapshot synchronization started: (typeName: ***, memberIndex: replica-group-2, snapshotLastLogTerm: 8, snapshotLastLogIndex: 4802) -> (typeName: ***, memberIndex: replica-group-1, snapshotLastLogTerm: 2, snapshotLastLogIndex: 1013)
05:59:55.112 INFO Snapshot synchronization completed: (typeName: ***, memberIndex: replica-group-2) -> (typeName: ***, memberIndex: replica-group-1, snapshotLastLogTerm: 2, snapshotLastLogIndex: 1013)
05:59:55.113 INFO Snapshot synchronization started: (typeName: ***, memberIndex: replica-group-2, snapshotLastLogTerm: 8, snapshotLastLogIndex: 5741) -> (typeName: ***, memberIndex: replica-group-1, snapshotLastLogTerm: 2, snapshotLastLogIndex: 1013)
05:59:55.114 INFO Succeeded to saveSnapshot given metadata [SnapshotMetadata(SnapshotSyncManager:***:replica-group-2:replica-group-1:1, 3, 1697435995112, None)]
06:00:05.212 INFO Snapshot synchronization completed: (typeName: ***, memberIndex: replica-group-2) -> (typeName: ***, memberIndex: replica-group-1, snapshotLastLogTerm: 2, snapshotLastLogIndex: 1013)
06:00:05.213 INFO Succeeded to saveSnapshot given metadata [SnapshotMetadata(SnapshotSyncManager:***:replica-group-2:replica-group-1:1, 5, 1697436005212, None)]
06:00:05.213 INFO Stopping itself.
A SnapshotStore (for the entity, running in replica-group-2) wrote snapshots as below:
The log below, which is newly added for this diagnosis, shows EntitySnapshotsUpdated events (i.e., CompactionCompleted) SnapshotSyncManager subscribed:
The above logs suggest that SnapshotSyncManager read an old entity snapshot from a source SnapshotStore.
Note that RaftActor (running in replica-group-2) executed Raft log compactions as below:
05:40:50.718 INFO [Follower] compaction started (logEntryIndex: 1030, number of entities: 339)
05:40:51.081 INFO [Follower] compaction completed (term: Term(2), logEntryIndex: 1030)
05:44:29.638 INFO [Leader] compaction started (logEntryIndex: 2001, number of entities: 299)
05:44:30.026 INFO [Leader] compaction completed (term: Term(8), logEntryIndex: 2001)
05:48:27.318 INFO [Leader] compaction started (logEntryIndex: 2953, number of entities: 336)
05:48:28.103 INFO [Leader] compaction completed (term: Term(8), logEntryIndex: 2953)
05:52:20.308 INFO [Leader] compaction started (logEntryIndex: 3875, number of entities: 306)
05:52:20.728 INFO [Leader] compaction completed (term: Term(8), logEntryIndex: 3875)
05:55:59.609 INFO [Leader] compaction started (logEntryIndex: 4802, number of entities: 309)
05:56:00.031 INFO [Leader] compaction completed (term: Term(8), logEntryIndex: 4802)
05:59:48.508 INFO [Leader] compaction started (logEntryIndex: 5741, number of entities: 314)
05:59:48.878 INFO [Leader] compaction completed (term: Term(8), logEntryIndex: 5741)
06:03:38.708 INFO [Leader] compaction started (logEntryIndex: 6691, number of entities: 317)
06:03:39.058 INFO [Leader] compaction completed (term: Term(8), logEntryIndex: 6691)
06:07:34.478 INFO [Leader] compaction started (logEntryIndex: 7621, number of entities: 311)
06:07:34.848 INFO [Leader] compaction completed (term: Term(8), logEntryIndex: 7621)
06:11:10.428 INFO [Leader] compaction started (logEntryIndex: 8522, number of entities: 300)
06:11:10.829 INFO [Leader] compaction completed (term: Term(8), logEntryIndex: 8522)
Scenario
The log diagnosis suggests the following scenario happened. Suppose the following sequence:
SnapshotSyncManager in replica-group-1 is synchronizing entity snapshots from replica-group-2.
During the synchronization, RaftActor in replica-group-2 executed Raft log compaction.
After the SnapshotSyncManager completes the synchronization, it will start a new snapshot synchronization.
If a new SnapshotSyncManager instance (is different from the instance that ran the first synchronization) handles the new snapshot synchronization, the new synchronization succeeds. However, if the same SnapshotSyncManager instance runs these synchronizations, the new synchronization sometimes cannot synchronize new entity snapshots.
The following sequence diagram illustrates the details of this scenario:
sequenceDiagram
autonumber
%% Actors
participant RaftActor1 as RaftActor<br/>(replica-group-1)
participant SnapshotSyncManager as SnapshotSyncManager<br/>(replica-group-1)
participant ShardSnapshotStore1 as ShardSnapshotStore<br/>(replica-group-1)
participant SnapshotStore1 as SnapshotStore<br/>(replica-group-1, EntityId=1)
participant ShardSnapshotStore2ForSync as ShardSnapshotStore<br/>(replica-group-2)
participant SnapshotStore2ForSync as SnapshotStore<br/>(replica-group-2, EntityId=1)
participant PersistentDataStore as Persistent Data Store<br/>(i.e. Cassandra)
participant RaftActor2 as RaftActor<br/>(replica-group-2)
participant ShardSnapshotStore2 as ShardSnapshotStore<br/>(replica-group-2)
participant SnapshotStore2 as SnapshotStore<br/>(replica-group-2, EntityId=1)
%% Snapshot Synchronization
RaftActor1 --> RaftActor1: Attempt to start snapshot synchronization (source=replica-group-2)
activate RaftActor1
RaftActor1 --> SnapshotSyncManager: Spawn SnapshotSyncManager actor
activate SnapshotSyncManager
RaftActor1 ->> SnapshotSyncManager: Send SyncSnapshot
SnapshotSyncManager --> PersistentDataStore: Recovery from persisted data
SnapshotSyncManager --> SnapshotSyncManager: Start snapshot synchronization
activate SnapshotSyncManager
SnapshotSyncManager --> ShardSnapshotStore2ForSync: Spawn ShardSnapshotStore
SnapshotSyncManager ->> ShardSnapshotStore2ForSync: Send FetchSnapshot
ShardSnapshotStore2ForSync --> SnapshotStore2ForSync: Spawn SnapshotStore
ShardSnapshotStore2ForSync ->> SnapshotStore2ForSync: Forward FetchSnapshot
SnapshotStore2ForSync --> PersistentDataStore: Recovery from persisted data
SnapshotStore2ForSync ->> SnapshotSyncManager: Reply SnapshotFound
par Reading EntitySnapshot from the persistent data store
SnapshotSyncManager ->> ShardSnapshotStore1: Send SaveSnapshot
ShardSnapshotStore1 --> SnapshotStore1: Spawn SnapshotStore
SnapshotStore1 --> PersistentDataStore: Recovery from persisted data
ShardSnapshotStore1 ->> SnapshotStore1: Forward SaveSnapshot
SnapshotStore1 -->> PersistentDataStore: Persist EntitySnapshot
SnapshotStore1 ->> SnapshotSyncManager: Reply SaveSnapshotSuccess
SnapshotSyncManager -->> PersistentDataStore: Persist SyncCompleted
SnapshotSyncManager ->> RaftActor1: SyncSnapshotSucceeded
deactivate SnapshotSyncManager
deactivate RaftActor1
and Writing EntitySnapshot to the persistent data store
%% Compaction
RaftActor2 --> RaftActor2: Start compaction
activate RaftActor2
RaftActor2 ->> ShardSnapshotStore2: Send SaveSnapshot
ShardSnapshotStore2 ->> SnapshotStore2: Forward SaveSnapshot
SnapshotStore2 --> PersistentDataStore: Recovery from persisted data
SnapshotStore2 -->> PersistentDataStore: Persist EntitySnapshot
Note over SnapshotStore2, PersistentDataStore: Another SnapshotStore (replica-group-2, EntityId=1) doesn't read this persisted EntitySnapshot<br/>if it has already been recovered.
SnapshotStore2 ->> RaftActor2: Reply SaveSnapshotSuccess
RaftActor2 -->> PersistentDataStore: Persist CompactionCompleted
deactivate RaftActor2
RaftActor2 -->> RaftActor1: Send heartbeats, which tells RaftActor1 must start a new snapshot synchronization.
%% Snapshot Synchronization
RaftActor1 --> RaftActor1: Attempt to start snapshot synchronization (source=replica-group-2)
activate RaftActor1
RaftActor1 ->> SnapshotSyncManager: SyncSnapshot
end
SnapshotSyncManager --> SnapshotSyncManager: Start snapshot synchronization
activate SnapshotSyncManager
SnapshotSyncManager ->> ShardSnapshotStore2ForSync: Send FetchSnapshot
ShardSnapshotStore2ForSync ->> SnapshotStore2ForSync: Forward FetchSnapshot
Note over SnapshotStore2ForSync: SnapshotStore (replica-group-2, EntityId=1) returns the cached EntitySnapshot.<br/>It doesn't return newly written data that another one writes after that caching.
SnapshotStore2ForSync ->> SnapshotSyncManager: Reply SnapshotFound
SnapshotSyncManager ->> ShardSnapshotStore1: Send SaveSnapshot
ShardSnapshotStore1 ->> SnapshotStore1: Forward SaveSnapshot
SnapshotStore1 -->> SnapshotStore1: Skip the EntitySnapshot save if it has already saved the snapshot.
SnapshotStore1 ->> SnapshotSyncManager: Reply SaveSnapshotSuccess
SnapshotSyncManager -->> PersistentDataStore: Persist SyncCompleted
SnapshotStore1 ->> SnapshotSyncManager: Reply SaveSnapshotSuccess
SnapshotSyncManager ->> RaftActor1: SyncSnapshotSucceeded
deactivate SnapshotSyncManager
deactivate SnapshotSyncManager
deactivate RaftActor1
Loading
The following figure shows the hierarchy of these actors. While the SnapshotSyncManager in replica-group-1 spawns a source SnapshotStore as its child for fetching an entity snapshot, RaftActor in replica-group-2 spawns another SnapshotStore with the same persistence ID for compaction.
A SnapshotStore caches an entity snapshot in memory when it replays events from a persistent store. If a SnapshotSyncManager reuses the same SnapshotStore instance, it will read the cached entity snapshot,
which could be old, and then write the old one to a destination SnapshotStore.
Considrations
Possible solutions might be the following:
SnapshotSyncManager doesn't reuse any SnapshotStore instance.
It will start SnapshotStore instances when it starts a snapshot synchronization and
stop the instances when it completes the synchronization.
SnapshotSyncManager handles only one snapshot synchronization.
Another new snapshot synchronization will be handled by another new SnapshotSyncManager instance.
SnapshotSyncManager doesn't spawn any SnapshotStore as its child.
Instead, it interacts with the stores that run on the node in the source replica group such that
only one SnapshotStore with the same persistence ID runs in the cluster at the same time.
The text was updated successfully, but these errors were encountered:
version: 2.3.0
Log diagnosis
The following log got in a fault injection test. The log suggests that an entity (running in replica-group-1) might be facing data inconsistency (Detecting event number constraint violation is described in https://github.com/lerna-stack/akka-entity-replication/blob/v2.3.0/docs/typed/implementation_guide.md#detecting-data-inconsistencies-by-entity-implementation):
The above log also shows the entity recovered with a snapshot (index=2001) and no events.
However, RaftActor replicated the following events for that entity as below:
Before that happened, SnapshotSyncManager (running in replica-group-1) executed snapshot synchronizations as below:
A SnapshotStore (for the entity, running in replica-group-2) wrote snapshots as below:
However, a SnapshotStore (for the entity, running in replica-group-1) didn't write the snapshot (index=5741) as below:
The log below, which is newly added for this diagnosis, shows EntitySnapshotsUpdated events (i.e., CompactionCompleted) SnapshotSyncManager subscribed:
Also, the log below, which is also newly added for this diagnosis, shows entity snapshots SnapshotSyncManager copied:
The above logs suggest that SnapshotSyncManager read an old entity snapshot from a source SnapshotStore.
Note that RaftActor (running in replica-group-2) executed Raft log compactions as below:
Scenario
The log diagnosis suggests the following scenario happened. Suppose the following sequence:
If a new SnapshotSyncManager instance (is different from the instance that ran the first synchronization) handles the new snapshot synchronization, the new synchronization succeeds. However, if the same SnapshotSyncManager instance runs these synchronizations, the new synchronization sometimes cannot synchronize new entity snapshots.
The following sequence diagram illustrates the details of this scenario:
The following figure shows the hierarchy of these actors. While the SnapshotSyncManager in replica-group-1 spawns a source SnapshotStore as its child for fetching an entity snapshot, RaftActor in replica-group-2 spawns another SnapshotStore with the same persistence ID for compaction.
A SnapshotStore caches an entity snapshot in memory when it replays events from a persistent store. If a SnapshotSyncManager reuses the same SnapshotStore instance, it will read the cached entity snapshot,
which could be old, and then write the old one to a destination SnapshotStore.
Considrations
Possible solutions might be the following:
It will start SnapshotStore instances when it starts a snapshot synchronization and
stop the instances when it completes the synchronization.
Another new snapshot synchronization will be handled by another new SnapshotSyncManager instance.
Instead, it interacts with the stores that run on the node in the source replica group such that
only one SnapshotStore with the same persistence ID runs in the cluster at the same time.
The text was updated successfully, but these errors were encountered: