-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Managed Failover #3558
Add Managed Failover #3558
Conversation
shouldPause = pauseCh.ReceiveAsync(nil) | ||
if shouldPause { | ||
wfState = wfPaused | ||
resumeCh.Receive(ctx, nil) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we don't plan to handle the case where resume signal is sent before pause or multiple pause signal comes in before a resume signal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current thinking with this simple pause&resume implementation is operator can always find a way out if they mistakenly pause or resume.
Another benefit arguable is you can do pause twice then resume once to achieve some rate control
I am open to other suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the CLI, it can check the workflow state before sending a resume signal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also add some unit tests for the workflow implementation?
func init() { | ||
workflow.RegisterWithOptions(FailoverWorkflow, workflow.RegisterOptions{Name: WorkflowTypeName}) | ||
activity.RegisterWithOptions(FailoverActivity, activity.RegisterOptions{Name: failoverActivityName}) | ||
activity.RegisterWithOptions(GetDomainsActivity, activity.RegisterOptions{Name: getDomainsActivityName}) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are deprecated? Should we use the new register?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should, I have plan to update all worker to use the new one later.
return nil | ||
} | ||
|
||
func shouldFailover(domain *shared.DescribeDomainResponse, sourceCluster string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, check if it is a global domain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, added
|
||
pagesize := int32(200) | ||
var token []byte | ||
for more := true; more; more = len(token) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we do have a pagination iterator in common package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this is simple enough so will keep it, thanks for the info though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What changed?
Implement managed failover as a cadence workflow, so that the experience would be:
Cadence adm cluster failover start --source_cluster <dc> --target_cluster <dc2>
Cadence adm cluster failover query --r <rid>
Cadence adm cluster failover pause
Cadence adm cluster failover resume
Cadence adm cluster failover abort
Cadence adm cluster failover rollback --r <rid>
Why?
Cadence failover is designed and expected to be executed by Cadence users, but users find it is hard to failover:
As a result, it would take a long time for Cadence users to mitigate issues by failover domains.
We need to take over failover for eligible users, so that we can mitigate outage quicker, and reduce operation overhead for users.
How did you test it?
Local test
Will also do staging test
Potential risks
No risk.
But aggressive failover all domains may cause target dc under high load.