Fix task agressive retry with TwoPhaseRetryPolicy #3369
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changed?
Change retry policy for task processor with a new TwoPhaseRetryPolicy.
Why?
Currently, task retry is using retry policy:
initial interval 50ms; expire interval 30s; max internal 10s
such retry in task processor lead to 12 attempts in 30s, and during a overloaded task outage, it will intensively repeat such policy (for example, 5 min outage will cause 100+ retries which make things worse)
We have to retry forever to not lose task, but during overloaded task outage, we don't want meaningless intensive retry.
So this PR add a TwoPhaseRetryPolicy, that support retry 3 time really quick, then slowly retry in second phase.
By this way, failed task retry will be limited during outage (same 5 min outage will now have 11 retries)
The parameter is based on observed metrics:
p99 task latency 200ms; we almost never seen task failures unless outage.
We can consider make it configurable later.
How did you test it?
unit test
Potential risks
task latency increase in rare failure cases.