Skip to content
This repository has been archived by the owner on Apr 1, 2023. It is now read-only.

Status never seems to set to failed after retries complete #125

Closed
lazyatom opened this issue Feb 20, 2018 · 2 comments · Fixed by #134
Closed

Status never seems to set to failed after retries complete #125

lazyatom opened this issue Feb 20, 2018 · 2 comments · Fixed by #134

Comments

@lazyatom
Copy link
Contributor

Hi there,

We're having some trouble getting Sidekiq::Status to return a failed status after a job completes its number of retries. I have a test worker as follows:

class TestWorker
  include Sidekiq::Worker
  include Sidekiq::Status::Worker

  sidekiq_options retry: 2

  def perform
    raise "Just fail"
  end
end

I've also added a debug line into the sidekiq_status/lib/server_middleware.rb file, just after an exception is raised:

# ... inside Sidekiq::Status::ServerMiddleware#call ...
rescue Exception
  p {status: status, msg: msg, retry_attempts: retry_attempts_from(msg['retry'], DEFAULT_MAX_RETRY_ATTEMPTS)}
  status = :failed
    if msg['retry']
# ... etc ...

I then queued up a job and monitored the sidekiq server process to see the logging. As expected, the job was tried three times, and here are the extracted debug lines:

First attempt

{:status=>:failed, :msg=>{"class"=>"TestWorker", "args"=>[], "retry"=>2, "queue"=>"default", "jid"=>"e728c12a05bf477ae35ce5e5", "created_at"=>1519129585.42536, "enqueued_at"=>1519129585.428743}, :retry_attempts=>2}

Sidekiq logs warning excerpt just after this point

2018-02-20T12:26:26.269Z 23231 TID-oxaaxbq6c TestWorker JID-e728c12a05bf477ae35ce5e5 DEBUG: Failure! Retry 0 in 27 seconds
2018-02-20T12:26:26.269Z 23231 TID-oxaaxbq6c TestWorker JID-e728c12a05bf477ae35ce5e5 INFO: fail: 0.06 sec
2018-02-20T12:26:26.270Z 23231 TID-oxaaxbq6c WARN: {"context":"Job raised exception","job":{"class":"TestWorker","args":[],"retry":2,"queue":"default","jid":"e728c12a05bf477ae35ce5e5","created_at":1519129585.42536,"enqueued_at":1519129585.428743,"error_message":"Just fail","error_class":"RuntimeError","failed_at":1519129586.2693021,"retry_count":0},"jobstr":"{\"class\":\"TestWorker\",\"args\":[],\"retry\":2,\"queue\":\"default\",\"jid\":\"e728c12a05bf477ae35ce5e5\",\"created_at\":1519129585.42536,\"enqueued_at\":1519129585.428743}"}
2018-02-20T12:26:26.270Z 23231 TID-oxaaxbq6c WARN: RuntimeError: Just fail

Second attempt

{:status=>:failed, :msg=>{"class"=>"TestWorker", "args"=>[], "retry"=>2, "queue"=>"default", "jid"=>"e728c12a05bf477ae35ce5e5", "created_at"=>1519129585.42536, "enqueued_at"=>1519129618.381927, "error_message"=>"Just fail", "error_class"=>"RuntimeError", "failed_at"=>1519129586.2693021, "retry_count"=>0}, :retry_attempts=>2}

Sidekiq log excerpt just after this point

2018-02-20T12:26:58.446Z 23231 TID-oxaaxbv1w TestWorker JID-e728c12a05bf477ae35ce5e5 DEBUG: Failure! Retry 1 in 46 seconds
2018-02-20T12:26:58.446Z 23231 TID-oxaaxbv1w TestWorker JID-e728c12a05bf477ae35ce5e5 INFO: fail: 0.064 sec
2018-02-20T12:26:58.447Z 23231 TID-oxaaxbv1w WARN: {"context":"Job raised exception","job":{"class":"TestWorker","args":[],"retry":2,"queue":"default","jid":"e728c12a05bf477ae35ce5e5","created_at":1519129585.42536,"enqueued_at":1519129618.381927,"error_message":"Just fail","error_class":"RuntimeError","failed_at":1519129586.2693021,"retry_count":1,"retried_at":1519129618.446338},"jobstr":"{\"class\":\"TestWorker\",\"args\":[],\"retry\":2,\"queue\":\"default\",\"jid\":\"e728c12a05bf477ae35ce5e5\",\"created_at\":1519129585.42536,\"enqueued_at\":1519129618.381927,\"error_message\":\"Just fail\",\"error_class\":\"RuntimeError\",\"failed_at\":1519129586.2693021,\"retry_count\":0}"}

2018-02-20T12:26:58.447Z 23231 TID-oxaaxbv1w WARN: RuntimeError: Just fail

Third attempt

{:status=>:failed, :msg=>{"class"=>"TestWorker", "args"=>[], "retry"=>2, "queue"=>"default", "jid"=>"e728c12a05bf477ae35ce5e5", "created_at"=>1519129585.42536, "enqueued_at"=>1519129680.062838, "error_message"=>"Just fail", "error_class"=>"RuntimeError", "failed_at"=>1519129586.2693021, "retry_count"=>1, "retried_at"=>1519129618.446338}, :retry_attempts=>2}

Sidekiq log excerpt just after this point

2018-02-20T12:28:00.332Z 23231 TID-oxaaxbsco TestWorker JID-e728c12a05bf477ae35ce5e5 DEBUG: Retries exhausted for job
2018-02-20T12:28:00.332Z 23231 TID-oxaaxbsco TestWorker JID-e728c12a05bf477ae35ce5e5 INFO: Adding dead TestWorker job e728c12a05bf477ae35ce5e5
2018-02-20T12:28:00.333Z 23231 TID-oxaaxbsco TestWorker JID-e728c12a05bf477ae35ce5e5 INFO: fail: 0.069 sec
2018-02-20T12:28:00.333Z 23231 TID-oxaaxbsco WARN: {"context":"Job raised exception","job":{"class":"TestWorker","args":[],"retry":2,"queue":"default","jid":"e728c12a05bf477ae35ce5e5","created_at":1519129585.42536,"enqueued_at":1519129680.062838,"error_message":"Just fail","error_class":"RuntimeError","failed_at":1519129586.2693021,"retry_count":2,"retried_at":1519129680.332525},"jobstr":"{\"class\":\"TestWorker\",\"args\":[],\"retry\":2,\"queue\":\"default\",\"jid\":\"e728c12a05bf477ae35ce5e5\",\"created_at\":1519129585.42536,\"enqueued_at\":1519129680.062838,\"error_message\":\"Just fail\",\"error_class\":\"RuntimeError\",\"failed_at\":1519129586.2693021,\"retry_count\":1,\"retried_at\":1519129618.446338}"}
2018-02-20T12:28:00.333Z 23231 TID-oxaaxbsco WARN: RuntimeError: Just fail

From this logging, it looks like retry_count is only added after the second attempt, i.e. the first retry, but it's set to zero, rather than one. On the third attempt (second and final retry), our retry_count is 1, and retry_attempts_from still returns 2, and so the comparison in Sidekiq::Status::ServerMiddleware#call succeeds and Sidekiq::Status will set our status back to retrying, even though Sidekiq itself has actually stopped retrying and moved the job to the dead queue.

I don't think we have anything particularly strange going on in our app or config; We're using Sidekiq Pro, Sidekiq::VERSION is 5.0.4 and Sidekiq::Status::VERSION is 1.0.1. We're also using the sidekiq-scheduler gem, although in my tests here I've disabled it so I don't think it's interfering.

Do you have any ideas what's happening?

@kenaniah
Copy link
Collaborator

kenaniah commented Mar 6, 2018

Indeed. Looks like we may need to subtract 1 from the oddly zero-indexed retry count to balance that comparison as you've pointed out.

Earliest I would be able to get around to fixing is this weekend, but I'm more than happy to merge a fresh PR with a test to cover this scenario.

@perryn
Copy link

perryn commented Apr 4, 2018

Hi, Just letting you know we came across this bug recently too and monkey patched sidekiq-status/server_middleware.rb to fix it. The problem is indeed that the first retry is retry '0' which is the actually the second try

we changed the relevant lines in server_middlerware.rb to

if msg['retry']
        try_count = msg['retry_count'] ? (msg['retry_count'] + 1) : 0
        if try_count < retry_attempts_from(msg['retry'], DEFAULT_MAX_RETRY_ATTEMPTS)
          status = :retrying
        end
end

EDIT: my colleague @judywu29 has prepared a pull request (that also deals with sidekiq version differences).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants