Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Managed Failover #3558

Merged
merged 21 commits into from
Oct 6, 2020
Merged

Add Managed Failover #3558

merged 21 commits into from
Oct 6, 2020

Conversation

vancexu
Copy link
Contributor

@vancexu vancexu commented Sep 29, 2020

What changed?

Implement managed failover as a cadence workflow, so that the experience would be:

  • Start failover.
    • Cadence adm cluster failover start --source_cluster <dc> --target_cluster <dc2>
    • Addition --domains flag to failover only provided domains.
  • Show failover progress.
    • Cadence adm cluster failover query --r <rid>
  • Pause and resume failover.
    • Cadence adm cluster failover pause
    • Cadence adm cluster failover resume
  • Abort failover
    • Cadence adm cluster failover abort
  • Rollback a failover with retention.
    • Cadence adm cluster failover rollback --r <rid>
  • Show failover history
    • List workflow

Why?

Cadence failover is designed and expected to be executed by Cadence users, but users find it is hard to failover:

  • They don’t know how. Failover requires knowledge on Lisa and CLI with multiple manual steps.
  • They don’t know when. During active data center outages, they need to reach out to the Cadence team to figure out when is recommended to failover to where.

As a result, it would take a long time for Cadence users to mitigate issues by failover domains.

We need to take over failover for eligible users, so that we can mitigate outage quicker, and reduce operation overhead for users.

How did you test it?
Local test
Will also do staging test

Potential risks
No risk.
But aggressive failover all domains may cause target dc under high load.

@coveralls
Copy link

coveralls commented Sep 29, 2020

Coverage Status

Coverage decreased (-0.09%) to 68.829% when pulling 3010e66 on managedfo into bb2ca3f on master.

@vancexu vancexu marked this pull request as ready for review September 30, 2020 01:18
@vancexu vancexu requested review from a team and yux0 September 30, 2020 01:18
@vancexu vancexu changed the title Add Failover Manager Add Managed Failover Sep 30, 2020
service/worker/failoverManager/starter.go Outdated Show resolved Hide resolved
tools/cli/adminFailoverCommands.go Show resolved Hide resolved
tools/cli/adminFailoverCommands.go Outdated Show resolved Hide resolved
tools/cli/adminFailoverCommands.go Show resolved Hide resolved
Comment on lines 182 to 186
shouldPause = pauseCh.ReceiveAsync(nil)
if shouldPause {
wfState = wfPaused
resumeCh.Receive(ctx, nil)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we don't plan to handle the case where resume signal is sent before pause or multiple pause signal comes in before a resume signal?

Copy link
Contributor Author

@vancexu vancexu Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current thinking with this simple pause&resume implementation is operator can always find a way out if they mistakenly pause or resume.
Another benefit arguable is you can do pause twice then resume once to achieve some rate control

I am open to other suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the CLI, it can check the workflow state before sending a resume signal

service/worker/failoverManager/workflow.go Outdated Show resolved Hide resolved
service/worker/failoverManager/workflow.go Outdated Show resolved Hide resolved
Copy link
Contributor

@yycptt yycptt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also add some unit tests for the workflow implementation?

Comment on lines 123 to 127
func init() {
workflow.RegisterWithOptions(FailoverWorkflow, workflow.RegisterOptions{Name: WorkflowTypeName})
activity.RegisterWithOptions(FailoverActivity, activity.RegisterOptions{Name: failoverActivityName})
activity.RegisterWithOptions(GetDomainsActivity, activity.RegisterOptions{Name: getDomainsActivityName})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are deprecated? Should we use the new register?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should, I have plan to update all worker to use the new one later.

return nil
}

func shouldFailover(domain *shared.DescribeDomainResponse, sourceCluster string) bool {
Copy link
Contributor

@yux0 yux0 Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, check if it is a global domain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, added


pagesize := int32(200)
var token []byte
for more := true; more; more = len(token) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we do have a pagination iterator in common package

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is simple enough so will keep it, thanks for the info though.

@vancexu vancexu requested review from yycptt and yux0 October 3, 2020 01:15
Copy link
Contributor

@yycptt yycptt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vancexu vancexu merged commit 63bdb5d into master Oct 6, 2020
@vancexu vancexu deleted the managedfo branch October 6, 2020 17:34
yux0 pushed a commit that referenced this pull request Oct 6, 2020
mkolodezny pushed a commit to mkolodezny/cadence that referenced this pull request Oct 7, 2020
github-actions bot pushed a commit to vytautas-karpavicius/cadence that referenced this pull request Feb 4, 2021
yux0 pushed a commit to yux0/cadence that referenced this pull request May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants