Fix retry status changing to failed #134

lazyatom · 2018-06-19T14:39:16Z

Based on the approach in #131, although refactored to hopefully more clearly explain what's going on within Sidekiq, and most importantly with a fix for the intermittent build failure (hopefully!). See the commit messages for more details.

Unlike #131, I found that the retry_count issue was also exhibited by Sidekiq 4, so my fix includes an offset for that too.

Fixes #125

The `retry_count` number that Sidekiq makes available to retried jobs can be a little confusing. It's actually the number of retries that we have completed, minus one. It's easiest to understand via an example. Given a job with `retries: 3`, we'll see the following pattern: - initial run (fails), middleware was called with `msg['retry_count']` not present - first retry (fails), middleware was called with `msg['retry_count'] == 0` - second retry (fails), middleware was called with `msg['retry_count'] == 1` - third retry (fails), middleware was called with `msg['retry_count'] == 2` The third retry is the final one, after which we _should_ be setting the status to `:failed`, but the retry count at that point is `2`, which is less than the value `retry_attempts_from` will return (`3`), and so the status remained `:retrying`. Part of the reason this is confusing is because this `retry_count` value feels like it should reflect the current state of the job after it fails, but it's actually the same value right even before the job executes. The right way of thinking about it is something like "this was retry attempt #0", and the first retry is indexed `0`, somewhat confusingly. This behaviour appears to be present in Sidekiq 4 and 5, so to make things work again, we can add an offset when running under those versions, to make the comparison work and correctly set the status to `:failed` after the final retry does not succeed. I've tried to extract this into private methods that make things a bit clearer.

After introducing a job that does a retry, we'd occasionally see the tests fail because of a timeout. This is because of the random factor involved in the delay of the retry. The retry amount is calculate as (count ** 4) + 15 + (rand(30)*(count+1)) and so the maximum delay for the first try would be (0 ** 4) + 15 + (29*(0+1)) == 44 This is greater than the timeout we use in tests, causing the build to fail. Without a way of configuring the exponential backoff, the only way to fix this while still including tests that do retry a job is by increasing the timeout. I've picked 60 as a fairly arbitrary round number that's more than 44.

lazyatom added 2 commits June 19, 2018 15:17

lazyatom mentioned this pull request Jun 19, 2018

Fix problem of status never set to failed #131

Closed

kenaniah merged commit 881c702 into utgarda:master Jun 19, 2018

lazyatom deleted the fix-retry-status-changing-to-failed branch June 20, 2018 12:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix retry status changing to failed #134

Fix retry status changing to failed #134

lazyatom commented Jun 19, 2018

Fix retry status changing to failed #134

Fix retry status changing to failed #134

Conversation

lazyatom commented Jun 19, 2018