-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix failover error causing child workflows to get stuck #5919
Fix failover error causing child workflows to get stuck #5919
Conversation
Pull Request Test Coverage Report for Build 018ef6cc-7efe-4e66-a292-55d301b12f3bDetails
💛 - Coveralls |
Codecov Report
Additional details and impacted files
... and 19 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
if err == errTargetDomainNotActive { | ||
t.scope.IncCounter(metrics.TaskTargetNotActiveCounterPerDomain) | ||
t.logger.Error("Dropping 'domain-not-active' error as non-retriable", tag.Error(err)) | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the problem I think, it's just dropping a task on the floor which is a problem in a race
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the domain is failed over to the other region, the tasks are not able to start child workflow in this region because this region is NOT active. I think even if you retry the task, you'll still get the same error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense @Shaddoll. in that case should we remove the time elapsed check all together for these domain-not-active related errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logs show that the error was emitted from the newly active region, in which case I expect it to converge towards being correct on retry.
(for context in the failover, I was failing up-cm-dca from PHX back to its original location in DCA at 12:43PST, the errors were emitted from dca. During the exact moment of failover I don't expect all history hosts to have completely up-to-date information about which domains are active and during a transfer event I would expect that a few would be dispatched and picked up by other hosts that are out of date)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synced offline and now I understand that there might be a race condition with failover and domain cache update in different history hosts.
Approved, since the risk of the change is very low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. this is to avoid dropping the task on the newly active side 👍
b005d32
into
cadence-workflow:master
## What changed? This mostly* removes the cross-cluster feature. ## Background The Cross-cluster feature was the ability to launch and interact with child workflows in another domain. It included the ability to start child workflows and signal them. The feature allowed child workflows to be launched in the target domain even if it was active in another region. ## Problems The feature itself was something that very very few of our customers apparently needed, with very few customers interested in the problem of launching child workflows in another cluster, and zero who weren’t able to simply use an activity to make an RPC call to the other domain as one would with any normal workflow. The feature-itself was quite resource intensive: It was pull-based; spinning up a polling stack which polled the other cluster for work, similar to the replication stack. This polling behaviour made the latency characteristics fairly unpredictable and used considerable DB resources, to the point that we just turned it off. The Uber/Cadence team resolved that were there sufficient demand for the feature in the future, a push based mechanism would probably be significantly preferable. The feature itself added a nontrivial amount of complexity to the codebase in a few areas such as task processing and domain error handling which introduced difficult to understand bugs such as the child workflow dropping error #5919 Decision to deprecate and alternatives As of releases June 2024, the feature will be removed. The Cadence team is not aware of any users of the feature outside Uber (as it was broken until mid 2021 anyway), but as an FYI, it will cease to be available. If this behaviour is desirable, an easy workaround is as previously mentioned: Use an activity to launch or signal the workflows in the other domain and block as needed. PR details This is a fairly high-risk refactor so it'll take some time to land. Broadly it: Entirely removes the cross-cluster feature and behaviour from workflow execution Leaves the API, Enums and persistence layer untouched. The intention is that a followup PR will remove the persistence-layer parts of the Cross-cluster feature. Notable callouts - This likely fixes a few bugs around failovers, as the current cross-cluster behaviour treats domain-not-active errors as an error to swallow which is a clear race condition - It probably contributes to errors between parent/child workflows just due to the sheer complexity of the code added, this is large simplification.
What changed?
This hopefully fixes some child workflows erroring-out during a failover. I was doing some testing and got reports of stuck child workflows. The child workflows, as far as I can tell, were being created at high throughput during a failover, and the failover caused a race between presumably the various history shards in having up-to-date information about the state of the domain.
Since a domain update is a fairly async process, and there's no clear guantee that a transfer task to create a child workflow will not get there before the other history node has updated it's domain cache, it seems likely to me that we shouldn't have ever treated this as a non-retriable error (I wrote this when I first join the team, but it was probably a bad idea).
My guess is that we probably will back out most of these changes to handle domain-not-active errors for cross-domain code, since the feature-itself is likely to be deprecated. For now though, I think just guarding it more safely should avoid the problem (hopefully).
Reasoning:
I did a failover to this region at precisely the second that this error is omitted: