Skip to content
This repository has been archived by the owner on Apr 1, 2023. It is now read-only.

Fix retry status changing to failed #134

Merged
merged 2 commits into from
Jun 19, 2018
Merged

Fix retry status changing to failed #134

merged 2 commits into from
Jun 19, 2018

Conversation

lazyatom
Copy link
Contributor

Based on the approach in #131, although refactored to hopefully more clearly explain what's going on within Sidekiq, and most importantly with a fix for the intermittent build failure (hopefully!). See the commit messages for more details.

Unlike #131, I found that the retry_count issue was also exhibited by Sidekiq 4, so my fix includes an offset for that too.

Fixes #125

The `retry_count` number that Sidekiq makes available to retried jobs
can be a little confusing. It's actually the number of retries that we
have completed, minus one. It's easiest to understand via an example.

Given a job with `retries: 3`, we'll see the following pattern:

- initial run (fails), middleware was called with `msg['retry_count']`
  not present
- first retry (fails), middleware was called with `msg['retry_count']
  == 0`
- second retry (fails), middleware was called with `msg['retry_count']
  == 1`
- third retry (fails), middleware was called with `msg['retry_count']
  == 2`

The third retry is the final one, after which we _should_ be setting
the status to `:failed`, but the retry count at that point is `2`,
which is less than the value `retry_attempts_from` will return (`3`),
and so the status remained `:retrying`.

Part of the reason this is confusing is because this `retry_count`
value feels like it should reflect the current state of the job after
it fails, but it's actually the same value right even before the job
executes. The right way of thinking about it is something like "this
was retry attempt #0", and the first retry is indexed `0`, somewhat
confusingly.

This behaviour appears to be present in Sidekiq 4 and 5, so to make
things work again, we can add an offset when running under those
versions, to make the comparison work and correctly set the status to
`:failed` after the final retry does not succeed.

I've tried to extract this into private methods that make things a bit
clearer.
After introducing a job that does a retry, we'd occasionally see the
tests fail because of a timeout. This is because of the random factor
involved in the delay of the retry. The retry amount is calculate as

    (count ** 4) + 15 + (rand(30)*(count+1))

and so the maximum delay for the first try would be

    (0 ** 4) + 15 + (29*(0+1)) == 44

This is greater than the timeout we use in tests, causing the build to
fail. Without a way of configuring the exponential backoff, the only
way to fix this while still including tests that do retry a job is by
increasing the timeout.

I've picked 60 as a fairly arbitrary round number that's more than 44.
@kenaniah kenaniah merged commit 881c702 into utgarda:master Jun 19, 2018
@lazyatom lazyatom deleted the fix-retry-status-changing-to-failed branch June 20, 2018 12:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Status never seems to set to failed after retries complete
2 participants