Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Starting a follower member later than leader completes a compaction may break ReplicatedLog of the follower #105

Closed
negokaz opened this issue Nov 29, 2021 · 0 comments · Fixed by #110
Labels
bug Something isn't working
Milestone

Comments

@negokaz
Copy link
Contributor

negokaz commented Nov 29, 2021

version

2.0.0

Details

The following log is taken from CI.

2021-11-29T01:40:07.6088032Z [JVM-4] === Transition: Recovering -> Follower ===
2021-11-29T01:40:07.6089055Z [JVM-4] === [Recovering] election-timeout after 828 ms ===
2021-11-29T01:40:07.6243011Z [JVM-3] === Transition: Recovering -> Follower ===
2021-11-29T01:40:07.6245596Z [JVM-3] === [Recovering] election-timeout after 1451 ms ===
2021-11-29T01:40:08.4506281Z [JVM-4] === [Follower] election timeout ===
2021-11-29T01:40:08.4509293Z [JVM-4] === [Follower] broadcast RequestVote(NormalizedShardId(118),Term(1),member-3,0,Term(0)) ===
2021-11-29T01:40:08.4568858Z [JVM-3] === [Follower] accept RequestVote(NormalizedShardId(118),Term(1),member-3,0,Term(0)) ===
2021-11-29T01:40:08.4642075Z [JVM-3] === [Follower] persisting time: 9 ms ===
2021-11-29T01:40:08.4645587Z [JVM-3] === [Follower] election-timeout after 1287 ms ===
2021-11-29T01:40:08.4705631Z [JVM-4] === [Follower] persisting time: 19 ms ===
2021-11-29T01:40:08.4706871Z [JVM-4] === Transition: Follower -> Candidate ===
2021-11-29T01:40:08.4708044Z [JVM-4] === [Follower] election-timeout after 1015 ms ===
2021-11-29T01:40:08.4709174Z [JVM-4] === [Candidate] accept self RequestVote ===
2021-11-29T01:40:08.4795805Z [JVM-4] === [Candidate] persisting time: 8 ms ===
2021-11-29T01:40:08.4797535Z [JVM-4] === [Candidate] accept for member-2 ===
2021-11-29T01:40:08.4798767Z [JVM-4] === [Candidate] election-timeout after 754 ms ===
2021-11-29T01:40:08.4799938Z [JVM-4] === [Candidate] accept for member-3 ===
2021-11-29T01:40:08.4801615Z [JVM-4] === Transition: Candidate -> Leader ===
2021-11-29T01:40:08.4802734Z [JVM-4] === [Leader] Heartbeat after 100 ms ===
2021-11-29T01:40:08.4804215Z [JVM-4] [Leader] New leader was elected (term: Term(1), lastLogTerm: Term(0), lastLogIndex: 0)
2021-11-29T01:40:08.4857115Z [JVM-4] === [Leader] persisting time: 5 ms ===
2021-11-29T01:40:08.4858429Z [JVM-4] === [Leader] Heartbeat after 100 ms ===
2021-11-29T01:40:08.4860135Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),List(LogEntry(1, EntityEvent(None,NoOp), Term(1))),0) to member-1 ===
2021-11-29T01:40:08.4862454Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),List(LogEntry(1, EntityEvent(None,NoOp), Term(1))),0) to member-2 ===
2021-11-29T01:40:08.4882041Z [JVM-3] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),Vector(LogEntry(1, EntityEvent(None,NoOp), Term(1))),0) ===
2021-11-29T01:40:08.4965696Z [JVM-3] === [Follower] persisting time: 8 ms ===
2021-11-29T01:40:08.4970818Z [JVM-3] === Transition: Follower -> Follower ===
2021-11-29T01:40:08.4973102Z [JVM-3] === [Follower] election-timeout after 1018 ms ===
2021-11-29T01:40:08.5004947Z [JVM-4] === [Leader] committed LogEntry(1, EntityEvent(None,NoOp), Term(1)) and will notify it to ClientContext(Actor[akka://ReplicationRegionSpec/system/sharding/raft-shard-typeName-6-member-3/118/118#121746048],None,None) ===
2021-11-29T01:40:08.5007878Z [JVM-4] === [Leader] broadcast TryCreateEntity(NormalizedShardId(118),NormalizedEntityId(replication-7)) ===
2021-11-29T01:40:08.5010007Z [JVM-4] === [Leader] created an entity (NormalizedEntityId(replication-7)) ===
2021-11-29T01:40:08.5012667Z [JVM-4] === [Leader] broadcast TryCreateEntity(NormalizedShardId(118),NormalizedEntityId(replication-7)) ===
2021-11-29T01:40:08.5014685Z [JVM-4] === [Leader] broadcast TryCreateEntity(NormalizedShardId(118),NormalizedEntityId(replication-7)) ===
2021-11-29T01:40:08.5016988Z [JVM-4] === [Leader] broadcast TryCreateEntity(NormalizedShardId(118),NormalizedEntityId(replication-7)) ===
2021-11-29T01:40:08.5020041Z [JVM-4] === akka://ReplicationRegionSpec/system/sharding/raft-shard-typeName-6-member-3/118/118/replication-7 started ===
2021-11-29T01:40:08.5170432Z [JVM-3] === [Follower] created an entity (NormalizedEntityId(replication-7)) ===
2021-11-29T01:40:08.5172747Z [JVM-3] === akka://ReplicationRegionSpec/system/sharding/raft-shard-typeName-6-member-2/118/118/replication-7 started ===
2021-11-29T01:40:08.5269739Z [JVM-4] === [Leader] persisting time: 18 ms ===
2021-11-29T01:40:08.5271286Z [JVM-4] === [Leader] Heartbeat after 100 ms ===
2021-11-29T01:40:08.5273739Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),List(LogEntry(1, EntityEvent(None,NoOp), Term(1)), LogEntry(2, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1))),1) to member-1 ===
2021-11-29T01:40:08.5277363Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,1,Term(1),List(LogEntry(2, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1))),1) to member-2 ===
2021-11-29T01:40:08.5350282Z [JVM-3] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,1,Term(1),Vector(LogEntry(2, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1))),1) ===
2021-11-29T01:40:08.5415919Z [JVM-3] === [Follower] persisting time: 11 ms ===
2021-11-29T01:40:08.5417542Z [JVM-3] === Transition: Follower -> Follower ===
2021-11-29T01:40:08.5420013Z [JVM-3] === [Follower] election-timeout after 1047 ms ===
2021-11-29T01:40:08.5476597Z [JVM-4] === [Leader] committed LogEntry(2, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)) and will notify it to ClientContext(Actor[akka://ReplicationRegionSpec/system/sharding/raft-shard-typeName-6-member-3/118/118/replication-7#-1166201394],Some(EntityInstanceId(4)),Some(Actor[akka://ReplicationRegionSpec/system/testActor-2#-562953879])) ===
2021-11-29T01:40:08.5480275Z [JVM-4] updateState
2021-11-29T01:40:08.5551610Z [JVM-4] === [Leader] persisting time: 6 ms ===
2021-11-29T01:40:08.5557949Z [JVM-4] === [Leader] Heartbeat after 100 ms ===
2021-11-29T01:40:08.5569675Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),List(LogEntry(1, EntityEvent(None,NoOp), Term(1)), LogEntry(2, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)), LogEntry(3, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1))),2) to member-1 ===
2021-11-29T01:40:08.5572835Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,2,Term(1),List(LogEntry(3, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1))),2) to member-2 ===
2021-11-29T01:40:08.5583768Z [JVM-3] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,2,Term(1),Vector(LogEntry(3, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1))),2) ===
2021-11-29T01:40:08.5632158Z [JVM-3] === [Follower] persisting time: 4 ms ===
2021-11-29T01:40:08.5643330Z [JVM-3] === [Follower] applying received to ReplicationActor ===
2021-11-29T01:40:08.5644707Z [JVM-3] === Transition: Follower -> Follower ===
2021-11-29T01:40:08.5646529Z [JVM-3] === [Follower] election-timeout after 1119 ms ===
2021-11-29T01:40:08.5647793Z [JVM-3] updateState
2021-11-29T01:40:08.5658162Z [JVM-4] === [Leader] committed LogEntry(3, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)) and will notify it to ClientContext(Actor[akka://ReplicationRegionSpec/system/sharding/raft-shard-typeName-6-member-3/118/118/replication-7#-1166201394],Some(EntityInstanceId(4)),Some(Actor[akka://ReplicationRegionSpec/system/testActor-2#-562953879])) ===
2021-11-29T01:40:08.5661126Z [JVM-4] updateState
2021-11-29T01:40:08.5719364Z [JVM-4] === [Leader] persisting time: 4 ms ===
2021-11-29T01:40:08.5720831Z [JVM-4] === [Leader] Heartbeat after 100 ms ===
2021-11-29T01:40:08.5724012Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),List(LogEntry(1, EntityEvent(None,NoOp), Term(1)), LogEntry(2, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)), LogEntry(3, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)), LogEntry(4, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1))),3) to member-1 ===
2021-11-29T01:40:08.5728722Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,3,Term(1),List(LogEntry(4, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1))),3) to member-2 ===
2021-11-29T01:40:08.5742753Z [JVM-3] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,3,Term(1),Vector(LogEntry(4, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1))),3) ===
2021-11-29T01:40:08.5796262Z [JVM-3] === [Follower] persisting time: 4 ms ===
2021-11-29T01:40:08.5797722Z [JVM-3] === [Follower] applying received to ReplicationActor ===
2021-11-29T01:40:08.5799194Z [JVM-3] === Transition: Follower -> Follower ===
2021-11-29T01:40:08.5800381Z [JVM-3] === [Follower] election-timeout after 1069 ms ===
2021-11-29T01:40:08.5802030Z [JVM-3] updateState
2021-11-29T01:40:08.5814218Z [JVM-4] === [Leader] committed LogEntry(4, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)) and will notify it to ClientContext(Actor[akka://ReplicationRegionSpec/system/sharding/raft-shard-typeName-6-member-3/118/118/replication-7#-1166201394],Some(EntityInstanceId(4)),Some(Actor[akka://ReplicationRegionSpec/system/testActor-2#-562953879])) ===
2021-11-29T01:40:08.5816804Z [JVM-4] updateState
2021-11-29T01:40:08.5817754Z [JVM-4] receive: GetStatusWithEnsuringConsistency: 3
2021-11-29T01:40:08.5868483Z [JVM-4] === [Leader] persisting time: 4 ms ===
2021-11-29T01:40:08.5869399Z [JVM-4] === [Leader] Heartbeat after 100 ms ===
2021-11-29T01:40:08.5872854Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),List(LogEntry(1, EntityEvent(None,NoOp), Term(1)), LogEntry(2, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)), LogEntry(3, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)), LogEntry(4, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)), LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1))),4) to member-1 ===
2021-11-29T01:40:08.5884437Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,4,Term(1),List(LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1))),4) to member-2 ===
2021-11-29T01:40:08.5887307Z [JVM-3] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,4,Term(1),Vector(LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1))),4) ===
2021-11-29T01:40:08.5945533Z [JVM-3] === [Follower] persisting time: 4 ms ===
2021-11-29T01:40:08.5946553Z [JVM-3] === [Follower] applying received to ReplicationActor ===
2021-11-29T01:40:08.5947529Z [JVM-3] === Transition: Follower -> Follower ===
2021-11-29T01:40:08.5949958Z [JVM-3] === [Follower] election-timeout after 1072 ms ===
2021-11-29T01:40:08.5950938Z [JVM-3] updateState
2021-11-29T01:40:08.6016936Z [JVM-4] === [Leader] committed LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1)) and will notify it to ClientContext(Actor[akka://ReplicationRegionSpec/system/sharding/raft-shard-typeName-6-member-3/118/118/replication-7#-1166201394],Some(EntityInstanceId(4)),Some(Actor[akka://ReplicationRegionSpec/system/testActor-2#-562953879])) ===
2021-11-29T01:40:08.6097062Z [JVM-4] [Leader] compaction started (logEntryIndex: 5, number of entities: 1)
2021-11-29T01:40:08.6167771Z [JVM-3] [Follower] compaction started (logEntryIndex: 4, number of entities: 1)
2021-11-29T01:40:08.6236212Z [JVM-4] === [Leader] persisting time: 8 ms ===
2021-11-29T01:40:08.6237435Z [JVM-4] [Leader] compaction completed (term: Term(1), logEntryIndex: 5)
2021-11-29T01:40:08.6352304Z [JVM-3] === [Follower] persisting time: 13 ms ===
2021-11-29T01:40:08.6353486Z [JVM-3] [Follower] compaction completed (term: Term(1), logEntryIndex: 4)
2021-11-29T01:40:08.6999164Z [JVM-4] === [Leader] Heartbeat after 100 ms ===
2021-11-29T01:40:08.7000884Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),List(LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1))),5) to member-1 ===
2021-11-29T01:40:08.7002654Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),List(),5) to member-2 ===
2021-11-29T01:40:08.7037435Z [JVM-3] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),Vector(),5) ===
2021-11-29T01:40:08.7046710Z [JVM-3] === Transition: Follower -> Follower ===
2021-11-29T01:40:08.7047957Z [JVM-3] === [Follower] election-timeout after 944 ms ===
2021-11-29T01:40:08.8206990Z [JVM-4] === [Leader] Heartbeat after 100 ms ===
2021-11-29T01:40:08.8209555Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),List(LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1))),5) to member-1 ===
2021-11-29T01:40:08.8211191Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),List(),5) to member-2 ===
2021-11-29T01:40:08.8226811Z [JVM-3] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),Vector(),5) ===
2021-11-29T01:40:08.8239893Z [JVM-3] === Transition: Follower -> Follower ===
2021-11-29T01:40:08.8241240Z [JVM-3] === [Follower] election-timeout after 1317 ms ===
2021-11-29T01:40:08.9386334Z [JVM-4] === [Leader] Heartbeat after 100 ms ===
2021-11-29T01:40:08.9391154Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),List(LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1))),5) to member-1 ===
2021-11-29T01:40:08.9393248Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),List(),5) to member-2 ===
2021-11-29T01:40:08.9418004Z [JVM-3] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),Vector(),5) ===
2021-11-29T01:40:08.9419669Z [JVM-3] === Transition: Follower -> Follower ===
2021-11-29T01:40:08.9420657Z [JVM-3] === [Follower] election-timeout after 1370 ms ===
2021-11-29T01:40:09.0595264Z [JVM-4] === [Leader] Heartbeat after 100 ms ===
2021-11-29T01:40:09.0600777Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),List(LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1))),5) to member-1 ===
2021-11-29T01:40:09.0604698Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),List(),5) to member-2 ===
2021-11-29T01:40:09.0618736Z [JVM-3] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),Vector(),5) ===
2021-11-29T01:40:09.0627021Z [JVM-3] === Transition: Follower -> Follower ===
2021-11-29T01:40:09.0628435Z [JVM-3] === [Follower] election-timeout after 1002 ms ===
2021-11-29T01:40:09.1792066Z [JVM-4] === [Leader] Heartbeat after 100 ms ===
2021-11-29T01:40:09.1795222Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),List(LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1))),5) to member-1 ===
2021-11-29T01:40:09.1797560Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),List(),5) to member-2 ===
2021-11-29T01:40:09.1828432Z [JVM-3] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),Vector(),5) ===
2021-11-29T01:40:09.1830055Z [JVM-3] === Transition: Follower -> Follower ===
2021-11-29T01:40:09.1831340Z [JVM-3] === [Follower] election-timeout after 1386 ms ===
2021-11-29T01:40:09.2998241Z [JVM-4] === [Leader] Heartbeat after 100 ms ===
2021-11-29T01:40:09.3027415Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),List(LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1))),5) to member-1 ===
2021-11-29T01:40:09.3029718Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),List(),5) to member-2 ===
2021-11-29T01:40:09.3031537Z [JVM-3] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),Vector(),5) ===
2021-11-29T01:40:09.3033435Z [JVM-3] === Transition: Follower -> Follower ===
2021-11-29T01:40:09.3034572Z [JVM-3] === [Follower] election-timeout after 896 ms ===
2021-11-29T01:40:09.3611293Z [JVM-2] === Transition: Recovering -> Follower ===
2021-11-29T01:40:09.3612802Z [JVM-2] === [Recovering] election-timeout after 1089 ms ===
2021-11-29T01:40:09.3615673Z [JVM-2] === [Follower] accept RequestVote(NormalizedShardId(118),Term(1),member-3,0,Term(0)) ===
2021-11-29T01:40:09.3701289Z [JVM-2] === [Follower] persisting time: 9 ms ===
2021-11-29T01:40:09.3702623Z [JVM-2] === [Follower] election-timeout after 769 ms ===
2021-11-29T01:40:09.3704457Z [JVM-2] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),Vector(LogEntry(1, EntityEvent(None,NoOp), Term(1))),0) ===
2021-11-29T01:40:09.3815983Z [JVM-2] === [Follower] persisting time: 10 ms ===
2021-11-29T01:40:09.3820109Z [JVM-2] === Transition: Follower -> Follower ===
2021-11-29T01:40:09.3821664Z [JVM-2] === [Follower] election-timeout after 765 ms ===
2021-11-29T01:40:09.3823112Z [JVM-2] === [Follower] created an entity (NormalizedEntityId(replication-7)) ===
2021-11-29T01:40:09.3827060Z [JVM-2] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),Vector(LogEntry(1, EntityEvent(None,NoOp), Term(1)), LogEntry(2, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1))),1) ===
2021-11-29T01:40:09.3830936Z [JVM-2] === akka://ReplicationRegionSpec/system/sharding/raft-shard-typeName-6-member-1/118/118/replication-7 started ===
2021-11-29T01:40:09.3888895Z [JVM-2] === [Follower] persisting time: 6 ms ===
2021-11-29T01:40:09.3890569Z [JVM-2] === Transition: Follower -> Follower ===
2021-11-29T01:40:09.3891877Z [JVM-2] === [Follower] election-timeout after 1287 ms ===
2021-11-29T01:40:09.3895106Z [JVM-2] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),Vector(LogEntry(1, EntityEvent(None,NoOp), Term(1)), LogEntry(2, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)), LogEntry(3, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1))),2) ===
2021-11-29T01:40:09.3967778Z [JVM-2] === [Follower] persisting time: 5 ms ===
2021-11-29T01:40:09.3970336Z [JVM-2] === [Follower] applying received to ReplicationActor ===
2021-11-29T01:40:09.3972891Z [JVM-2] === Transition: Follower -> Follower ===
2021-11-29T01:40:09.3975181Z [JVM-2] === [Follower] election-timeout after 769 ms ===
2021-11-29T01:40:09.3978218Z [JVM-2] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),Vector(LogEntry(1, EntityEvent(None,NoOp), Term(1)), LogEntry(2, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)), LogEntry(3, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)), LogEntry(4, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1))),3) ===
2021-11-29T01:40:09.4123649Z [JVM-2] === [Follower] persisting time: 15 ms ===
2021-11-29T01:40:09.4125446Z [JVM-2] === [Follower] applying received to ReplicationActor ===
2021-11-29T01:40:09.4131575Z [JVM-2] === Transition: Follower -> Follower ===
2021-11-29T01:40:09.4134414Z [JVM-2] === [Follower] election-timeout after 857 ms ===
2021-11-29T01:40:09.4140257Z [JVM-2] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),Vector(LogEntry(1, EntityEvent(None,NoOp), Term(1)), LogEntry(2, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)), LogEntry(3, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)), LogEntry(4, EntityEvent(Some(NormalizedEntityId(replication-7)),received), Term(1)), LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1))),4) ===
2021-11-29T01:40:09.4191180Z [JVM-4] === [Leader] Heartbeat after 100 ms ===
2021-11-29T01:40:09.4193789Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,4,Term(1),List(LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1))),5) to member-1 ===
2021-11-29T01:40:09.4196093Z [JVM-4] === [Leader] publish AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),List(),5) to member-2 ===
2021-11-29T01:40:09.4206443Z [JVM-2] === [Follower] persisting time: 7 ms ===
2021-11-29T01:40:09.4228542Z [JVM-3] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,5,Term(1),Vector(),5) ===
2021-11-29T01:40:09.4230243Z [JVM-3] === Transition: Follower -> Follower ===
2021-11-29T01:40:09.4231521Z [JVM-3] === [Follower] election-timeout after 1073 ms ===
2021-11-29T01:40:09.4289174Z [JVM-2] === [Follower] applying received to ReplicationActor ===
2021-11-29T01:40:09.4299810Z [JVM-2] === Transition: Follower -> Follower ===
2021-11-29T01:40:09.4300772Z [JVM-2] === [Follower] election-timeout after 843 ms ===
2021-11-29T01:40:09.4303220Z [JVM-2] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),Vector(LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1))),5) ===
2021-11-29T01:40:09.4435699Z [JVM-2] === [Follower] persisting time: 16 ms ===

The following AppendEntries should have prevLogIndex: 4 and prevLogTerm: 1 because entries contains a LogEntry which has logEntryIndex: 5, however it has prevLogIndex: 0 and prevLogTerm: 0.

2021-11-29T01:40:09.4672596Z [JVM-2] === [Follower] append AppendEntries(NormalizedShardId(118),Term(1),member-3,0,Term(0),Vector(LogEntry(5, EntityEvent(Some(NormalizedEntityId(replication-7)),NoOp), Term(1))),5) ===
  final case class AppendEntries(
      shardId: NormalizedShardId,
      term: Term,
      leader: MemberIndex,
      prevLogIndex: LogEntryIndex,
      prevLogTerm: Term,
      entries: Seq[LogEntry],
      leaderCommit: LogEntryIndex,
  ) extends RaftRequest
      with ClusterReplicationSerializable

If the follower accepts the command, ReplicatedLog is broken by merging the log entries the command has.

  def merge(thatEntries: Seq[LogEntry], prevLogIndex: LogEntryIndex): ReplicatedLog = {
    val newEntries = this.entries.takeWhile(_.index <= prevLogIndex) ++ thatEntries
    copy(newEntries)
  }

akka-entity-replication/ReplicatedLog.scala at v2.0.0 · lerna-stack/akka-entity-replication

The leader has completed compaction right before sending the AppendEntries.
So the leader has not had enough log entries for sending to the follower.

2021-11-29T01:40:08.6237435Z [JVM-4] [Leader] compaction completed (term: Term(1), logEntryIndex: 5)

In this situation, the leader should send the InstallSnapshot command to the follower.
InstallSnapshot is sent when prevLogTerm is resolved as None.

      val nextIndex    = currentData.nextIndexFor(memberIndex)
      val prevLogIndex = nextIndex.prev()
      val prevLogTerm  = currentData.replicatedLog.termAt(prevLogIndex)

akka-entity-replication/Leader.scala at v2.0.0 · lerna-stack/akka-entity-replication

Probably the prevLogIndex for the follower was LogEntryIndex.initial().
I guess that termAt to find prevLogTerm should return None when LogEntryIndex.initial() is passed.

  def termAt(logEntryIndex: LogEntryIndex): Option[Term] =
    logEntryIndex match {
      case initialLogIndex if initialLogIndex == LogEntryIndex.initial() => Option(Term.initial())
      case `ancestorLastIndex`                                           => Option(ancestorLastTerm)
      case logEntryIndex                                                 => get(logEntryIndex).map(_.term)
    }

akka-entity-replication/ReplicatedLog.scala at v2.0.0 · lerna-stack/akka-entity-replication

PR

@negokaz negokaz added the bug Something isn't working label Dec 1, 2021
@negokaz negokaz changed the title Starting a follower member later than leader completes a compaction may breaks ReplicatedLog of the follower Starting a follower member later than leader completes a compaction may break ReplicatedLog of the follower Dec 3, 2021
@xirc xirc added this to the v2.1.0 milestone Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants