Rollback requires the following information:
- The Raft shard ID to rollback
- Target UNIX timestamp: the rollback tool will roll back the Raft shard to this timestamp (called the rollback point below). Use th leader's timestamp if the cluster nodes might have a clock gap (clock out-of-sync).
- The leader's Raft role at the rollback point: you can determine the leader by searching INFO logs. See Determine the leader by searching INFO logs for details.
All clock synchronization gaps among all nodes should be less than a certain period. By default, these gaps should be less than 10 seconds. To configure this period, see Configure the rollback tool for details.
akka-entity-replication
(v.2.3.0 or later) supports deleting old events and snapshots. Once events or snapshots
have been deleted, a rollback to a timestamp that requires such deleted events or snapshots is impossible. The rollback
tool can detect such deletions. If such a timestamp is specified, the rollback tool fails during preparation and doesn't
issue any deletion of data for the rollback.
To ensure that the target Raft shard stops during the rollback, deploy a new configuration to disable the shard.
Use ClusterReplicationSettings.withDisabledShards
to disable the specific shards.
For example, to disable Raft shard 1
use the following code:
import akka.actor.typed.ActorSystem
import lerna.akka.entityreplication.typed._
val system: ActorSystem[_] = ???
val clusterReplication = ClusterReplication(system)
val settings =
ClusterReplicationSettings(system)
.withDisabledShards(Set("1"))
val entity = ReplicatedEntity(???)(???).withSettings(settings)
clusterReplication.init(entity)
It is possible to avoid sending a request to an entity on the disabled shards, which reduces unnecessary request timeout waits. For more details, see Avoid sending requests to disabled entities.
CassandraRaftShardRollback
can roll back the specific Raft shard to the specific UNIX timestamp as below:
import akka.actor.typed.ActorSystem
import com.typesafe.config._
import java.time.Instant
import lerna.akka.entityreplication.typed._
import lerna.akka.entityreplication.rollback.cassandra._
val system: ActorSystem[_] = ???
val toTimestamp: Instant = ???
val typeName: String = ???
val targetShardId: String = ???
val multiRaftRoles: Set[String] = ClusterReplicationSettings(system).raftSettings.multiRaftRoles
val targetLeaderRaftRole: String = ???
val rollback = CassandraRaftShardRollback(system)
for {
rollbackSetup <- rollback.prepareRollback(
typeName,
targetShardId,
multiRaftRoles,
targetLeaderRaftRole,
toTimestamp,
)
_ <- ??? // review the setup if needed
_ <- rollback.rollback(rollbackSetup)
} yield Done
Note that some events are not committed (not replicated to the majority of nodes) after this rollback. These events will be committed by procedures described in the following sections.
For replicating events (should be committed, but not yet committed) by the newly elected leader, deploy the following configurations:
- Enable the sticky leader for the target Raft shard. Use
ClusterReplicationSettings.withStickyLeaders
to enable the sticky leader for the specific Raft shards. - Enable the target Raft shard. Use
ClusterReplicationSettings.withDisabledShards
to enable the specific Raft shards.
For example, to set Raft role replica-group-1
as the sticky leader for Raft shard 1
and enable the Raft shard, use
the following code:
import akka.actor.typed.ActorSystem
import lerna.akka.entityreplication.typed._
val system: ActorSystem[_] = ???
val clusterReplication = ClusterReplication(system)
val settings =
ClusterReplicationSettings(system)
.withDisabledShards(Set.empty)
.withStickyLeaders(Map("1" -> "replica-group-1"))
val entity = ReplicatedEntity(???)(???).withSettings(settings)
clusterReplication.init(entity)
The leader at the rollback point is elected again after this deployment. This election ensures the leader replicates events that should be committed. Note that you must only deploy with the sticky leader configuration. If you deploy without the sticky leader, the newly elected leader might truncate events that should be committed, which means some committed events before the rollback will be lost.
After the completion of replication for events that should be committed, deploy a new configuration to disable sticky
leader. Use ClusterReplicationSettings.withStickyLeaders
to disable the sticky leader for the specific shards.
For example, to disable the sticky leader for Raft shard 1
, use the following code:
import akka.actor.typed.ActorSystem
import lerna.akka.entityreplication.typed._
val system: ActorSystem[_] = ???
val clusterReplication = ClusterReplication(system)
val settings =
ClusterReplicationSettings(system)
.withStickyLeaders(Map.empty)
val entity = ReplicatedEntity(???)(???).withSettings(settings)
clusterReplication.init(entity)
There are configuration properties for the rollback tool. Refer to reference.conf for details.
You also can configure the rollback tool using the custom configuration object programmatically as below:
import akka.actor.typed.ActorSystem
import com.typesafe.config._
import lerna.akka.entityreplication.rollback.cassandra._
val customConfig: Config = ConfigFactory.parseString(
"""
|dry-run = false
|log-progress-every = 200
|clock-out-of-sync-tolerance = 20s
|read-parallelism = 2
|write-parallelism = 2
|cassandra.raft-persistence-plugin-location = "custom.akka.persistence.cassandra.plugin"
|cassandra.raft-eventsourced-persistence-plugin-location = "custom.akka.persistence.cassandra.plugin"
|""".stripMargin)
val settings = CassandraRaftShardRollbackSettings(system, customConfig)
val rollback = CassandraRaftShardRollback(system, settings)
You can search INFO logs to determine the leader at the given time point.
Suppose that you have the following leader-election logs (for shard 22
):
$ zcat application_*.log.gz | grep -iF 'elected' | awk -F'\t' '{ if ($4 ~ /\/22$/) { print $0 } }'
00:19:57.944 INFO lerna.akka.entityreplication.raft.RaftActor akka://System@*.*.*.*:25520/system/sharding/raft-shard-*-replica-group-1/22/22 [Leader] New leader was elected (term: Term(2), lastLogTerm: Term(0), lastLogIndex: 0)
00:30:09.426 INFO lerna.akka.entityreplication.raft.RaftActor akka://System@*.*.*.*:25520/system/sharding/raft-shard-*-replica-group-3/22/22 [Leader] New leader was elected (term: Term(3), lastLogTerm: Term(2), lastLogIndex: 1662)
00:30:15.165 INFO lerna.akka.entityreplication.raft.RaftActor akka://System@*.*.*.*:25520/system/sharding/raft-shard-*-replica-group-3/22/22 [Leader] New leader was elected (term: Term(6), lastLogTerm: Term(3), lastLogIndex: 1663)
00:30:42.244 INFO lerna.akka.entityreplication.raft.RaftActor akka://System@*.*.*.*:25520/system/sharding/raft-shard-*-replica-group-3/22/22 [Leader] New leader was elected (term: Term(10), lastLogTerm: Term(6), lastLogIndex: 1747)
The above logs indicate that:
- The leader's Raft role is
replica-group-3
if you want to roll back to00:30:10.000
. - The leader's Raft role is
replica-group-1
if you want to roll back to00:25:00.000
.