Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix task agressive retry with TwoPhaseRetryPolicy #3369

Merged
merged 3 commits into from
Jul 2, 2020
Merged

Fix task agressive retry with TwoPhaseRetryPolicy #3369

merged 3 commits into from
Jul 2, 2020

Conversation

vancexu
Copy link
Contributor

@vancexu vancexu commented Jul 1, 2020

What changed?
Change retry policy for task processor with a new TwoPhaseRetryPolicy.

Why?
Currently, task retry is using retry policy:
initial interval 50ms; expire interval 30s; max internal 10s
such retry in task processor lead to 12 attempts in 30s, and during a overloaded task outage, it will intensively repeat such policy (for example, 5 min outage will cause 100+ retries which make things worse)

We have to retry forever to not lose task, but during overloaded task outage, we don't want meaningless intensive retry.
So this PR add a TwoPhaseRetryPolicy, that support retry 3 time really quick, then slowly retry in second phase.
By this way, failed task retry will be limited during outage (same 5 min outage will now have 11 retries)

The parameter is based on observed metrics:
p99 task latency 200ms; we almost never seen task failures unless outage.
We can consider make it configurable later.

How did you test it?
unit test

Potential risks
task latency increase in rare failure cases.

@vancexu vancexu requested review from yycptt and a team July 1, 2020 20:11
@@ -116,7 +116,7 @@ func newTaskProcessor(
domainMetricsScopeCache: shard.GetService().GetDomainMetricsScopeCache(),
timeSource: shard.GetTimeSource(),
workerNotificationChans: workerNotificationChans,
retryPolicy: common.CreatePersistanceRetryPolicy(),
retryPolicy: backoff.NewTwoPhaseRetryPolicy(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to use this two phase retry policy for other components? This task processor will be deprecated soon as we switch to the priority task processor, which is using a very different retry policy for tasks. You can find the new retry policy in common/util.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, major created it for this processor to alleviate incident. No plan to put it to priority task processor before new processor goes alive.
It can be used in other components where retry forever happens, but would be better to have some metrics before switching to this policy.

@vancexu vancexu merged commit 686f812 into master Jul 2, 2020
@vancexu vancexu deleted the retryf branch July 2, 2020 21:28
yux0 pushed a commit to yux0/cadence that referenced this pull request May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants