New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add Managed Failover #3558

Merged

vancexu merged 21 commits into master from managedfo

Oct 6, 2020

Contributor

vancexu commented Sep 29, 2020 •

edited

Loading

What changed?

Implement managed failover as a cadence workflow, so that the experience would be:

Start failover.
- Cadence adm cluster failover start --source_cluster <dc> --target_cluster <dc2>
- Addition --domains flag to failover only provided domains.
Show failover progress.
- Cadence adm cluster failover query --r <rid>
Pause and resume failover.
- Cadence adm cluster failover pause
- Cadence adm cluster failover resume
Abort failover
- Cadence adm cluster failover abort
Rollback a failover with retention.
- Cadence adm cluster failover rollback --r <rid>
Show failover history
- List workflow

Why?

Cadence failover is designed and expected to be executed by Cadence users, but users find it is hard to failover:

They don’t know how. Failover requires knowledge on Lisa and CLI with multiple manual steps.
They don’t know when. During active data center outages, they need to reach out to the Cadence team to figure out when is recommended to failover to where.

As a result, it would take a long time for Cadence users to mitigate issues by failover domains.

We need to take over failover for eligible users, so that we can mitigate outage quicker, and reduce operation overhead for users.

How did you test it?
Local test
Will also do staging test

Potential risks
No risk.
But aggressive failover all domains may cause target dc under high load.

vancexu added 2 commits

September 28, 2020 17:10


          Add Failover Manager

17a0458


          Add CLI start cmd

c9d7e98

coveralls commented Sep 29, 2020 •

edited

Loading

Coverage decreased (-0.09%) to 68.829% when pulling 3010e66 on managedfo into bb2ca3f on master.

vancexu added 7 commits

September 28, 2020 23:42


          Support pause and resume

b80f892


          Add query

17931a4


          Add abort

1bf9786


          Add rollback

f908aa6


          Add list

5c4de88


          tiny

98108f3


          Improve query state after abort

4aaf4b8

vancexu marked this pull request as ready for review

September 30, 2020 01:18

vancexu requested review from a team and yux0

September 30, 2020 01:18


          Merge branch 'master' into managedfo

1ec0466

vancexu changed the title ~~Add Failover Manager~~ Add Managed Failover


          Merge branch 'master' into managedfo

77110e6

yycptt reviewed

View reviewed changes

service/worker/failoverManager/starter.go Outdated Show resolved Hide resolved

tools/cli/adminFailoverCommands.go Show resolved Hide resolved

tools/cli/adminFailoverCommands.go Outdated Show resolved Hide resolved

tools/cli/adminFailoverCommands.go Show resolved Hide resolved

service/worker/failoverManager/workflow.go Outdated

Comment on lines 182 to 186

+              		shouldPause = pauseCh.ReceiveAsync(nil)
+              		if shouldPause {
+              			wfState = wfPaused
+              			resumeCh.Receive(ctx, nil)
+              		}

Contributor

yycptt Oct 1, 2020

so we don't plan to handle the case where resume signal is sent before pause or multiple pause signal comes in before a resume signal?

Contributor Author

vancexu Oct 2, 2020 •

edited

Loading

My current thinking with this simple pause&resume implementation is operator can always find a way out if they mistakenly pause or resume.
Another benefit arguable is you can do pause twice then resume once to achieve some rate control

I am open to other suggestions.

Contributor

yux0 Oct 2, 2020

From the CLI, it can check the workflow state before sending a resume signal

service/worker/failoverManager/workflow.go Outdated Show resolved Hide resolved

service/worker/failoverManager/workflow.go Outdated Show resolved Hide resolved

yycptt reviewed

View reviewed changes

Contributor

yycptt left a comment •

edited

Loading

Shall we also add some unit tests for the workflow implementation?


          Address comments

a95574c

yux0 reviewed

View reviewed changes

service/worker/failoverManager/workflow.go Outdated

Comment on lines 123 to 127

+              func init() {
+              	workflow.RegisterWithOptions(FailoverWorkflow, workflow.RegisterOptions{Name: WorkflowTypeName})
+              	activity.RegisterWithOptions(FailoverActivity, activity.RegisterOptions{Name: failoverActivityName})
+              	activity.RegisterWithOptions(GetDomainsActivity, activity.RegisterOptions{Name: getDomainsActivityName})
+              }

Contributor

yux0 Oct 2, 2020

Those are deprecated? Should we use the new register?

Contributor Author

vancexu Oct 3, 2020

we should, I have plan to update all worker to use the new one later.

yux0 reviewed

View reviewed changes

service/worker/failoverManager/workflow.go Outdated

+              	return nil
+              }
+              func shouldFailover(domain *shared.DescribeDomainResponse, sourceCluster string) bool {

Contributor

yux0 Oct 2, 2020 •

edited

Loading

also, check if it is a global domain?

Contributor Author

vancexu Oct 3, 2020

cool, added

yux0 reviewed

View reviewed changes

service/worker/failoverManager/workflow.go Outdated

+              	pagesize := int32(200)
+              	var token []byte
+              	for more := true; more; more = len(token) > 0 {

Contributor

yux0 Oct 2, 2020

nit: we do have a pagination iterator in common package

Contributor Author

vancexu Oct 3, 2020

I feel this is simple enough so will keep it, thanks for the info though.

vancexu added 3 commits

October 2, 2020 16:32


          Add workflow test

5871ab0


          Add test for getdomain activity

cbf2606


          Add TestFailoverActivity

78e6891

vancexu requested review from yycptt and yux0

October 3, 2020 01:15

vancexu added 3 commits

October 2, 2020 18:15


          Merge branch 'master' into managedfo

06a5019


          Merge branch 'master' into managedfo


          Merge branch 'master' into managedfo

d647f73

yycptt reviewed

View reviewed changes

tools/cli/adminFailoverCommands.go Outdated Show resolved Hide resolved

yycptt reviewed

View reviewed changes

service/worker/failovermanager/workflow.go Show resolved Hide resolved

vancexu added 2 commits

October 5, 2020 15:28


          Improve rollback, adding comments

3482ddb


          Merge branch 'master' into managedfo

707342a

yycptt approved these changes

View reviewed changes

Contributor

yycptt left a comment

LGTM


          Merge branch 'master' into managedfo

3010e66

vancexu merged commit 63bdb5d into master

vancexu deleted the managedfo branch

October 6, 2020 17:34

yux0 pushed a commit that referenced this pull request


          Add Managed Failover (#3558)

554b8bc

mkolodezny pushed a commit to mkolodezny/cadence that referenced this pull request


          Add Managed Failover (cadence-workflow#3558)

811bf89

github-actions bot pushed a commit to vytautas-karpavicius/cadence that referenced this pull request


          Add Managed Failover (cadence-workflow#3558)

783f19d

yux0 pushed a commit to yux0/cadence that referenced this pull request


          Add Managed Failover (cadence-workflow#3558)

194dcc1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet