Fix failover error causing child workflows to get stuck #5919

davidporter-id-au · 2024-04-18T03:33:27Z

What changed?

This hopefully fixes some child workflows erroring-out during a failover. I was doing some testing and got reports of stuck child workflows. The child workflows, as far as I can tell, were being created at high throughput during a failover, and the failover caused a race between presumably the various history shards in having up-to-date information about the state of the domain.

Since a domain update is a fairly async process, and there's no clear guantee that a transfer task to create a child workflow will not get there before the other history node has updated it's domain cache, it seems likely to me that we shouldn't have ever treated this as a non-retriable error (I wrote this when I first join the team, but it was probably a bad idea).

My guess is that we probably will back out most of these changes to handle domain-not-active errors for cross-domain code, since the feature-itself is likely to be deprecated. For now though, I think just guarding it more safely should avoid the problem (hopefully).

Reasoning:

I did a failover to this region at precisely the second that this error is omitted:

coveralls · 2024-04-18T04:59:21Z

Pull Request Test Coverage Report for Build 018ef6cc-7efe-4e66-a292-55d301b12f3b

Details

4 of 4 (100.0%) changed or added relevant lines in 1 file are covered.
80 unchanged lines in 21 files lost coverage.
Overall coverage decreased (-0.02%) to 67.607%

Files with Coverage Reduction	New Missed Lines	%
service/history/queue/timer_queue_processor_base.go	1	77.62%
service/history/task/transfer_standby_task_executor.go	2	87.04%
common/task/weighted_round_robin_task_scheduler.go	2	88.56%
common/persistence/sql/sqlplugin/postgres/task.go	2	73.4%
common/persistence/sql/sqlplugin/postgres/db.go	2	80.0%
common/task/fifo_task_scheduler.go	2	87.63%
common/persistence/sql/sqlplugin/mysql/task.go	2	73.68%
service/matching/db.go	2	73.23%
service/history/task/transfer_active_task_executor.go	2	72.72%
common/util.go	2	91.78%

Totals
Change from base Build 018ef587-aadc-40ad-bafb-8e5e189f2433:	-0.02%
Covered Lines:	99001
Relevant Lines:	146435

💛 - Coveralls

codecov · 2024-04-18T05:00:34Z

Codecov Report

Merging #5919 (4059735) into master (cc0805f) will decrease coverage by 0.12%.
The diff coverage is 100.00%.

❗ Current head 4059735 differs from pull request most recent head 89ca4cd. Consider uploading reports for the commit 89ca4cd to get more accurate results

Additional details and impacted files

Files	Coverage Δ
service/history/task/task.go	`78.71% <100.00%> (-0.32%)`	⬇️

... and 19 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc0805f...89ca4cd. Read the comment docs.

davidporter-id-au · 2024-04-18T05:16:53Z

service/history/task/task.go

-	if err == errTargetDomainNotActive {
-		t.scope.IncCounter(metrics.TaskTargetNotActiveCounterPerDomain)
-		t.logger.Error("Dropping 'domain-not-active' error as non-retriable", tag.Error(err))
-		return nil


this is the problem I think, it's just dropping a task on the floor which is a problem in a race

if the domain is failed over to the other region, the tasks are not able to start child workflow in this region because this region is NOT active. I think even if you retry the task, you'll still get the same error

that makes sense @Shaddoll. in that case should we remove the time elapsed check all together for these domain-not-active related errors?

The logs show that the error was emitted from the newly active region, in which case I expect it to converge towards being correct on retry.

(for context in the failover, I was failing up-cm-dca from PHX back to its original location in DCA at 12:43PST, the errors were emitted from dca. During the exact moment of failover I don't expect all history hosts to have completely up-to-date information about which domains are active and during a transfer event I would expect that a few would be dispatched and picked up by other hosts that are out of date)

synced offline and now I understand that there might be a race condition with failover and domain cache update in different history hosts.
Approved, since the risk of the change is very low.

Got it. this is to avoid dropping the task on the newly active side 👍

## What changed? This mostly* removes the cross-cluster feature. ## Background The Cross-cluster feature was the ability to launch and interact with child workflows in another domain. It included the ability to start child workflows and signal them. The feature allowed child workflows to be launched in the target domain even if it was active in another region. ## Problems The feature itself was something that very very few of our customers apparently needed, with very few customers interested in the problem of launching child workflows in another cluster, and zero who weren’t able to simply use an activity to make an RPC call to the other domain as one would with any normal workflow. The feature-itself was quite resource intensive: It was pull-based; spinning up a polling stack which polled the other cluster for work, similar to the replication stack. This polling behaviour made the latency characteristics fairly unpredictable and used considerable DB resources, to the point that we just turned it off. The Uber/Cadence team resolved that were there sufficient demand for the feature in the future, a push based mechanism would probably be significantly preferable. The feature itself added a nontrivial amount of complexity to the codebase in a few areas such as task processing and domain error handling which introduced difficult to understand bugs such as the child workflow dropping error #5919 Decision to deprecate and alternatives As of releases June 2024, the feature will be removed. The Cadence team is not aware of any users of the feature outside Uber (as it was broken until mid 2021 anyway), but as an FYI, it will cease to be available. If this behaviour is desirable, an easy workaround is as previously mentioned: Use an activity to launch or signal the workflows in the other domain and block as needed. PR details This is a fairly high-risk refactor so it'll take some time to land. Broadly it: Entirely removes the cross-cluster feature and behaviour from workflow execution Leaves the API, Enums and persistence layer untouched. The intention is that a followup PR will remove the persistence-layer parts of the Cross-cluster feature. Notable callouts - This likely fixes a few bugs around failovers, as the current cross-cluster behaviour treats domain-not-active errors as an error to swallow which is a clear race condition - It probably contributes to errors between parent/child workflows just due to the sheer complexity of the code added, this is large simplification.

Fix failover error causing child workflows to get stuck

7d46f8a

davidporter-id-au marked this pull request as ready for review April 18, 2024 03:36

davidporter-id-au requested review from Shaddoll, neil-xie, Groxx, shijiesheng, agautam478, jakobht, 3vilhamster, sankari165, dkrotx, taylanisikdemir and demirkayaender as code owners April 18, 2024 03:36

fix test

68e11f5

Merge branch 'master' into bugfix/failover-errors

b6bc4af

davidporter-id-au commented Apr 18, 2024

View reviewed changes

Merge branch 'master' into bugfix/failover-errors

99c9871

Shaddoll approved these changes Apr 18, 2024

View reviewed changes

Merge branch 'master' into bugfix/failover-errors

4059735

davidporter-id-au enabled auto-merge (squash) April 18, 2024 21:22

taylanisikdemir approved these changes Apr 18, 2024

View reviewed changes

Merge branch 'master' into bugfix/failover-errors

89ca4cd

davidporter-id-au merged commit b005d32 into cadence-workflow:master Apr 19, 2024
17 of 18 checks passed

davidporter-id-au mentioned this pull request Jun 6, 2024

Refactor/removing cross cluster feature #6121

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix failover error causing child workflows to get stuck #5919

Fix failover error causing child workflows to get stuck #5919

davidporter-id-au commented Apr 18, 2024 •

edited

Loading

coveralls commented Apr 18, 2024 •

edited

Loading

codecov bot commented Apr 18, 2024 •

edited

Loading

davidporter-id-au Apr 18, 2024

Shaddoll Apr 18, 2024 •

edited

Loading

taylanisikdemir Apr 18, 2024 •

edited

Loading

davidporter-id-au Apr 18, 2024 •

edited

Loading

Shaddoll Apr 18, 2024

taylanisikdemir Apr 18, 2024

Fix failover error causing child workflows to get stuck #5919

Fix failover error causing child workflows to get stuck #5919

Conversation

davidporter-id-au commented Apr 18, 2024 • edited Loading

Reasoning:

coveralls commented Apr 18, 2024 • edited Loading

Pull Request Test Coverage Report for Build 018ef6cc-7efe-4e66-a292-55d301b12f3b

Details

💛 - Coveralls

codecov bot commented Apr 18, 2024 • edited Loading

Codecov Report

davidporter-id-au Apr 18, 2024

Choose a reason for hiding this comment

Shaddoll Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

taylanisikdemir Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

davidporter-id-au Apr 18, 2024 • edited Loading

Choose a reason for hiding this comment

Shaddoll Apr 18, 2024

Choose a reason for hiding this comment

taylanisikdemir Apr 18, 2024

Choose a reason for hiding this comment

davidporter-id-au commented Apr 18, 2024 •

edited

Loading

coveralls commented Apr 18, 2024 •

edited

Loading

codecov bot commented Apr 18, 2024 •

edited

Loading

Shaddoll Apr 18, 2024 •

edited

Loading

taylanisikdemir Apr 18, 2024 •

edited

Loading

davidporter-id-au Apr 18, 2024 •

edited

Loading