writer: potential deadlocks on cleanup #13

mreiferson · 2013-10-06T04:54:26Z

When Writer cleans up after a connection error it is possible for there to be a race between the cleanup process (which stops the goroutine responsible for receiving on transactionChan) and new concurrent writes (which send to transactionChan), resulting in potential deadlocks until a new successful connection is made.

This is particularly problematic for nsq_to_nsq where Writer is used alongside Reader. After seeing an error writing, Reader enters backoff mode and sends RDY 1. Because a stray write is blocked on sending to transactionChan the source nsqd would (correctly) stop sending additional messages (RDY 1 == in-flight 1).

This, coupled with the fact that Writer only lazily connects to its destination hosts, means that until the in-flight message times out at the source there would be no additional message delivered to force a re-connect and free up the deadlock.

This fix adds accounting for the # of concurrent writers and ensures that transactionCleanup() recvs on transactionChan until that count reaches 0.

RFR @jphines @jehiah

I'll take payment in 🍻

…gnal the waitgroup

When Writer cleans up after a connection error it is possible for there to be a race between the cleanup process (which stops the goroutine responsible for receiving on transactionChan) and new concurrent writes (which send to transactionChan), resulting in potential deadlocks until a new successful connection is made. This is exemplified in nsq_to_nsq, where Writer is used alongside Reader such that, after seeing an error writing, Reader enters backoff mode and sends RDY 1... Because a stray write would be blocked on sending to transactionChan the source nsqd would (correctly) stop sending additional messages (RDY 1 == in-flight 1). This, coupled with the fact that Writer only lazily connects to its destination hosts, meant that until that in-flight message timed out at the source nsqd there would be no additional message delivered to force a re-connect and free up the deadlock. This fix adds accounting for the # of concurrent writers and ensures that transactionCleanup() recvs on transactionChan until that count reaches 0.

jphines · 2013-10-08T17:56:23Z

So changes LGTM. I've taken the liberty of deploying this to production, and will get back to you if this successfully resolves some of the issues we have been seeing.

mreiferson · 2013-10-08T18:08:47Z

🔥 🚒 😈

jphines · 2013-10-10T21:41:38Z

Rebase/Squash please.

mreiferson · 2013-10-10T21:43:16Z

no

writer: potential deadlocks on cleanup

mreiferson · 2013-10-10T21:45:04Z

❤️ @jphines

One pt oh

mreiferson added 6 commits October 5, 2013 16:33

writer: s/done/finish

c45d867

writer: return error during connection

0f87305

writer: close the connection if IDENTIFY returns an error response

e7e1fe8

writer: dont defer the transaction cleanup so it happens before we si…

0d23819

…gnal the waitgroup

writer: better logging/cleanup/comments

963a8ae

jphines pushed a commit that referenced this pull request Oct 10, 2013

Merge pull request #13 from mreiferson/writer_deadlock_13

4d7f86d

writer: potential deadlocks on cleanup

jphines merged commit 4d7f86d into nsqio:master Oct 10, 2013

sthulb pushed a commit to HailoOSS/go-nsq that referenced this pull request Sep 14, 2016

Merge pull request nsqio#13 from hailocab/one_pt_oh

e76909f

One pt oh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

writer: potential deadlocks on cleanup #13

writer: potential deadlocks on cleanup #13

mreiferson commented Oct 6, 2013

jphines commented Oct 8, 2013

mreiferson commented Oct 8, 2013

jphines commented Oct 10, 2013

mreiferson commented Oct 10, 2013

mreiferson commented Oct 10, 2013

writer: potential deadlocks on cleanup #13

writer: potential deadlocks on cleanup #13

Conversation

mreiferson commented Oct 6, 2013

jphines commented Oct 8, 2013

mreiferson commented Oct 8, 2013

jphines commented Oct 10, 2013

mreiferson commented Oct 10, 2013

mreiferson commented Oct 10, 2013