Skip to content

Latest commit

 

History

History
186 lines (136 loc) · 7.67 KB

rollback_guide.md

File metadata and controls

186 lines (136 loc) · 7.67 KB

Rollback Guide

Prerequisites

Rollback requires the following information:

  • The Raft shard ID to rollback
  • Target UNIX timestamp: the rollback tool will roll back the Raft shard to this timestamp (called the rollback point below). Use th leader's timestamp if the cluster nodes might have a clock gap (clock out-of-sync).
  • The leader's Raft role at the rollback point: you can determine the leader by searching INFO logs. See Determine the leader by searching INFO logs for details.

All clock synchronization gaps among all nodes should be less than a certain period. By default, these gaps should be less than 10 seconds. To configure this period, see Configure the rollback tool for details.

WARNING

akka-entity-replication (v.2.3.0 or later) supports deleting old events and snapshots. Once events or snapshots have been deleted, a rollback to a timestamp that requires such deleted events or snapshots is impossible. The rollback tool can detect such deletions. If such a timestamp is specified, the rollback tool fails during preparation and doesn't issue any deletion of data for the rollback.

Rollback Procedures

1. Disable the target Raft shard

To ensure that the target Raft shard stops during the rollback, deploy a new configuration to disable the shard. Use ClusterReplicationSettings.withDisabledShards to disable the specific shards.

For example, to disable Raft shard 1 use the following code:

import akka.actor.typed.ActorSystem
import lerna.akka.entityreplication.typed._

val system: ActorSystem[_] = ???
val clusterReplication = ClusterReplication(system)

val settings =
  ClusterReplicationSettings(system)
    .withDisabledShards(Set("1"))

val entity = ReplicatedEntity(???)(???).withSettings(settings)
clusterReplication.init(entity)

It is possible to avoid sending a request to an entity on the disabled shards, which reduces unnecessary request timeout waits. For more details, see Avoid sending requests to disabled entities.

2. Execute rollback for the target shard

CassandraRaftShardRollback can roll back the specific Raft shard to the specific UNIX timestamp as below:

import akka.actor.typed.ActorSystem
import com.typesafe.config._
import java.time.Instant
import lerna.akka.entityreplication.typed._
import lerna.akka.entityreplication.rollback.cassandra._

val system: ActorSystem[_] = ???

val toTimestamp: Instant = ???
val typeName: String = ???
val targetShardId: String = ???
val multiRaftRoles: Set[String] = ClusterReplicationSettings(system).raftSettings.multiRaftRoles
val targetLeaderRaftRole: String = ???

val rollback = CassandraRaftShardRollback(system)
for {
  rollbackSetup <- rollback.prepareRollback(
    typeName,
    targetShardId,
    multiRaftRoles,
    targetLeaderRaftRole,
    toTimestamp,
  )
  _ <- ??? // review the setup if needed
  _ <- rollback.rollback(rollbackSetup)
} yield Done

Note that some events are not committed (not replicated to the majority of nodes) after this rollback. These events will be committed by procedures described in the following sections.

3. Enable the target Raft shard and the sticky leader

For replicating events (should be committed, but not yet committed) by the newly elected leader, deploy the following configurations:

  • Enable the sticky leader for the target Raft shard. Use ClusterReplicationSettings.withStickyLeaders to enable the sticky leader for the specific Raft shards.
  • Enable the target Raft shard. Use ClusterReplicationSettings.withDisabledShards to enable the specific Raft shards.

For example, to set Raft role replica-group-1 as the sticky leader for Raft shard 1 and enable the Raft shard, use the following code:

import akka.actor.typed.ActorSystem
import lerna.akka.entityreplication.typed._

val system: ActorSystem[_] = ???
val clusterReplication = ClusterReplication(system)

val settings =
  ClusterReplicationSettings(system)
    .withDisabledShards(Set.empty)
    .withStickyLeaders(Map("1" -> "replica-group-1"))

val entity = ReplicatedEntity(???)(???).withSettings(settings)
clusterReplication.init(entity)

The leader at the rollback point is elected again after this deployment. This election ensures the leader replicates events that should be committed. Note that you must only deploy with the sticky leader configuration. If you deploy without the sticky leader, the newly elected leader might truncate events that should be committed, which means some committed events before the rollback will be lost.

4. Disable the sticky leader

After the completion of replication for events that should be committed, deploy a new configuration to disable sticky leader. Use ClusterReplicationSettings.withStickyLeaders to disable the sticky leader for the specific shards.

For example, to disable the sticky leader for Raft shard 1, use the following code:

import akka.actor.typed.ActorSystem
import lerna.akka.entityreplication.typed._

val system: ActorSystem[_] = ???
val clusterReplication = ClusterReplication(system)

val settings =
  ClusterReplicationSettings(system)
    .withStickyLeaders(Map.empty)

val entity = ReplicatedEntity(???)(???).withSettings(settings)
clusterReplication.init(entity)

Configure the rollback tool

There are configuration properties for the rollback tool. Refer to reference.conf for details.

You also can configure the rollback tool using the custom configuration object programmatically as below:

import akka.actor.typed.ActorSystem
import com.typesafe.config._
import lerna.akka.entityreplication.rollback.cassandra._

val customConfig: Config = ConfigFactory.parseString(
  """
    |dry-run = false
    |log-progress-every = 200
    |clock-out-of-sync-tolerance = 20s
    |read-parallelism = 2
    |write-parallelism = 2
    |cassandra.raft-persistence-plugin-location = "custom.akka.persistence.cassandra.plugin"
    |cassandra.raft-eventsourced-persistence-plugin-location = "custom.akka.persistence.cassandra.plugin"
    |""".stripMargin)

val settings = CassandraRaftShardRollbackSettings(system, customConfig)
val rollback = CassandraRaftShardRollback(system, settings)

Determine the leader by searching INFO logs

You can search INFO logs to determine the leader at the given time point.

Suppose that you have the following leader-election logs (for shard 22):

$ zcat application_*.log.gz | grep -iF 'elected' | awk -F'\t' '{ if ($4 ~ /\/22$/) { print $0 } }'
00:19:57.944 INFO lerna.akka.entityreplication.raft.RaftActor akka://System@*.*.*.*:25520/system/sharding/raft-shard-*-replica-group-1/22/22 [Leader] New leader was elected (term: Term(2), lastLogTerm: Term(0), lastLogIndex: 0) 
00:30:09.426 INFO lerna.akka.entityreplication.raft.RaftActor akka://System@*.*.*.*:25520/system/sharding/raft-shard-*-replica-group-3/22/22 [Leader] New leader was elected (term: Term(3), lastLogTerm: Term(2), lastLogIndex: 1662)
00:30:15.165 INFO lerna.akka.entityreplication.raft.RaftActor akka://System@*.*.*.*:25520/system/sharding/raft-shard-*-replica-group-3/22/22 [Leader] New leader was elected (term: Term(6), lastLogTerm: Term(3), lastLogIndex: 1663)
00:30:42.244 INFO lerna.akka.entityreplication.raft.RaftActor akka://System@*.*.*.*:25520/system/sharding/raft-shard-*-replica-group-3/22/22 [Leader] New leader was elected (term: Term(10), lastLogTerm: Term(6), lastLogIndex: 1747)

The above logs indicate that:

  • The leader's Raft role is replica-group-3 if you want to roll back to 00:30:10.000.
  • The leader's Raft role is replica-group-1 if you want to roll back to 00:25:00.000.