Fix task agressive retry with TwoPhaseRetryPolicy #3369

vancexu · 2020-07-01T20:11:38Z

What changed?
Change retry policy for task processor with a new TwoPhaseRetryPolicy.

Why?
Currently, task retry is using retry policy:
initial interval 50ms; expire interval 30s; max internal 10s
such retry in task processor lead to 12 attempts in 30s, and during a overloaded task outage, it will intensively repeat such policy (for example, 5 min outage will cause 100+ retries which make things worse)

We have to retry forever to not lose task, but during overloaded task outage, we don't want meaningless intensive retry.
So this PR add a TwoPhaseRetryPolicy, that support retry 3 time really quick, then slowly retry in second phase.
By this way, failed task retry will be limited during outage (same 5 min outage will now have 11 retries)

The parameter is based on observed metrics:
p99 task latency 200ms; we almost never seen task failures unless outage.
We can consider make it configurable later.

How did you test it?
unit test

Potential risks
task latency increase in rare failure cases.

yycptt · 2020-07-01T23:03:06Z

service/history/taskProcessor.go

@@ -116,7 +116,7 @@ func newTaskProcessor(
 		domainMetricsScopeCache: shard.GetService().GetDomainMetricsScopeCache(),
 		timeSource:              shard.GetTimeSource(),
 		workerNotificationChans: workerNotificationChans,
-		retryPolicy:             common.CreatePersistanceRetryPolicy(),
+		retryPolicy:             backoff.NewTwoPhaseRetryPolicy(),


Do you plan to use this two phase retry policy for other components? This task processor will be deprecated soon as we switch to the priority task processor, which is using a very different retry policy for tasks. You can find the new retry policy in common/util.go

For now, major created it for this processor to alleviate incident. No plan to put it to priority task processor before new processor goes alive.
It can be used in other components where retry forever happens, but would be better to have some metrics before switching to this policy.

)

vancexu added 2 commits June 30, 2020 16:15

Fix typo

bbce66a

Fix task agressive retry with TwoPhaseRetryPolicy

fa2f28d

vancexu requested review from yycptt and a team July 1, 2020 20:11

Merge branch 'master' into retryf

7b08850

yycptt reviewed Jul 1, 2020

View reviewed changes

yycptt approved these changes Jul 2, 2020

View reviewed changes

vancexu merged commit 686f812 into master Jul 2, 2020

vancexu deleted the retryf branch July 2, 2020 21:28

vancexu added a commit that referenced this pull request Jul 2, 2020

Fix task agressive retry with TwoPhaseRetryPolicy (#3369)

70c2881

yux0 pushed a commit to yux0/cadence that referenced this pull request May 4, 2021

Fix task agressive retry with TwoPhaseRetryPolicy (cadence-workflow#3369

df6e0d6

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix task agressive retry with TwoPhaseRetryPolicy #3369

Fix task agressive retry with TwoPhaseRetryPolicy #3369

vancexu commented Jul 1, 2020

yycptt Jul 1, 2020

vancexu Jul 2, 2020

Fix task agressive retry with TwoPhaseRetryPolicy #3369

Fix task agressive retry with TwoPhaseRetryPolicy #3369

Conversation

vancexu commented Jul 1, 2020

yycptt Jul 1, 2020

Choose a reason for hiding this comment

vancexu Jul 2, 2020

Choose a reason for hiding this comment