Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failover error causing child workflows to get stuck #5919

Conversation

davidporter-id-au
Copy link
Member

@davidporter-id-au davidporter-id-au commented Apr 18, 2024

What changed?

This hopefully fixes some child workflows erroring-out during a failover. I was doing some testing and got reports of stuck child workflows. The child workflows, as far as I can tell, were being created at high throughput during a failover, and the failover caused a race between presumably the various history shards in having up-to-date information about the state of the domain.

Since a domain update is a fairly async process, and there's no clear guantee that a transfer task to create a child workflow will not get there before the other history node has updated it's domain cache, it seems likely to me that we shouldn't have ever treated this as a non-retriable error (I wrote this when I first join the team, but it was probably a bad idea).

My guess is that we probably will back out most of these changes to handle domain-not-active errors for cross-domain code, since the feature-itself is likely to be deprecated. For now though, I think just guarding it more safely should avoid the problem (hopefully).

Reasoning:

Screenshot 2024-04-17 at 8 35 03 PM

I did a failover to this region at precisely the second that this error is omitted:

Screenshot 2024-04-17 at 7 49 21 PM

@coveralls
Copy link

coveralls commented Apr 18, 2024

Pull Request Test Coverage Report for Build 018ef6cc-7efe-4e66-a292-55d301b12f3b

Details

  • 4 of 4 (100.0%) changed or added relevant lines in 1 file are covered.
  • 80 unchanged lines in 21 files lost coverage.
  • Overall coverage decreased (-0.02%) to 67.607%

Files with Coverage Reduction New Missed Lines %
service/history/queue/timer_queue_processor_base.go 1 77.62%
service/history/task/transfer_standby_task_executor.go 2 87.04%
common/task/weighted_round_robin_task_scheduler.go 2 88.56%
common/persistence/sql/sqlplugin/postgres/task.go 2 73.4%
common/persistence/sql/sqlplugin/postgres/db.go 2 80.0%
common/task/fifo_task_scheduler.go 2 87.63%
common/persistence/sql/sqlplugin/mysql/task.go 2 73.68%
service/matching/db.go 2 73.23%
service/history/task/transfer_active_task_executor.go 2 72.72%
common/util.go 2 91.78%
Totals Coverage Status
Change from base Build 018ef587-aadc-40ad-bafb-8e5e189f2433: -0.02%
Covered Lines: 99001
Relevant Lines: 146435

💛 - Coveralls

Copy link

codecov bot commented Apr 18, 2024

Codecov Report

Merging #5919 (4059735) into master (cc0805f) will decrease coverage by 0.12%.
The diff coverage is 100.00%.

❗ Current head 4059735 differs from pull request most recent head 89ca4cd. Consider uploading reports for the commit 89ca4cd to get more accurate results

Additional details and impacted files
Files Coverage Δ
service/history/task/task.go 78.71% <100.00%> (-0.32%) ⬇️

... and 19 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc0805f...89ca4cd. Read the comment docs.

if err == errTargetDomainNotActive {
t.scope.IncCounter(metrics.TaskTargetNotActiveCounterPerDomain)
t.logger.Error("Dropping 'domain-not-active' error as non-retriable", tag.Error(err))
return nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the problem I think, it's just dropping a task on the floor which is a problem in a race

Copy link
Member

@Shaddoll Shaddoll Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the domain is failed over to the other region, the tasks are not able to start child workflow in this region because this region is NOT active. I think even if you retry the task, you'll still get the same error

Copy link
Member

@taylanisikdemir taylanisikdemir Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense @Shaddoll. in that case should we remove the time elapsed check all together for these domain-not-active related errors?

Copy link
Member Author

@davidporter-id-au davidporter-id-au Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logs show that the error was emitted from the newly active region, in which case I expect it to converge towards being correct on retry.

(for context in the failover, I was failing up-cm-dca from PHX back to its original location in DCA at 12:43PST, the errors were emitted from dca. During the exact moment of failover I don't expect all history hosts to have completely up-to-date information about which domains are active and during a transfer event I would expect that a few would be dispatched and picked up by other hosts that are out of date)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced offline and now I understand that there might be a race condition with failover and domain cache update in different history hosts.
Approved, since the risk of the change is very low.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. this is to avoid dropping the task on the newly active side 👍

@davidporter-id-au davidporter-id-au enabled auto-merge (squash) April 18, 2024 21:22
@davidporter-id-au davidporter-id-au merged commit b005d32 into cadence-workflow:master Apr 19, 2024
17 of 18 checks passed
davidporter-id-au added a commit that referenced this pull request Jun 27, 2024
## What changed?

This mostly* removes the cross-cluster feature.

## Background
The Cross-cluster feature was the ability to launch and interact with child workflows in another domain. It included the ability to start child workflows and signal them. The feature allowed child workflows to be launched in the target domain even if it was active in another region.

## Problems
The feature itself was something that very very few of our customers apparently needed, with very few customers interested in the problem of launching child workflows in another cluster, and zero who weren’t able to simply use an activity to make an RPC call to the other domain as one would with any normal workflow.
The feature-itself was quite resource intensive: It was pull-based; spinning up a polling stack which polled the other cluster for work, similar to the replication stack. This polling behaviour made the latency characteristics fairly unpredictable and used considerable DB resources, to the point that we just turned it off. The Uber/Cadence team resolved that were there sufficient demand for the feature in the future, a push based mechanism would probably be significantly preferable.
The feature itself added a nontrivial amount of complexity to the codebase in a few areas such as task processing and domain error handling which introduced difficult to understand bugs such as the child workflow dropping error #5919

Decision to deprecate and alternatives
As of releases June 2024, the feature will be removed. The Cadence team is not aware of any users of the feature outside Uber (as it was broken until mid 2021 anyway), but as an FYI, it will cease to be available.

If this behaviour is desirable, an easy workaround is as previously mentioned: Use an activity to launch or signal the workflows in the other domain and block as needed.

PR details
This is a fairly high-risk refactor so it'll take some time to land. Broadly it:

Entirely removes the cross-cluster feature and behaviour from workflow execution
Leaves the API, Enums and persistence layer untouched. The intention is that a followup PR will remove the persistence-layer parts of the Cross-cluster feature.
Notable callouts
- This likely fixes a few bugs around failovers, as the current cross-cluster behaviour treats domain-not-active errors as an error to swallow which is a clear race condition
- It probably contributes to errors between parent/child workflows just due to the sheer complexity of the code added, this is large simplification.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants