Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writer: potential deadlocks on cleanup #13

Merged
merged 6 commits into from
Oct 10, 2013

Conversation

mreiferson
Copy link
Member

When Writer cleans up after a connection error it is possible for there to be a race between the cleanup process (which stops the goroutine responsible for receiving on transactionChan) and new concurrent writes (which send to transactionChan), resulting in potential deadlocks until a new successful connection is made.

This is particularly problematic for nsq_to_nsq where Writer is used alongside Reader. After seeing an error writing, Reader enters backoff mode and sends RDY 1. Because a stray write is blocked on sending to transactionChan the source nsqd would (correctly) stop sending additional messages (RDY 1 == in-flight 1).

This, coupled with the fact that Writer only lazily connects to its destination hosts, means that until the in-flight message times out at the source there would be no additional message delivered to force a re-connect and free up the deadlock.

This fix adds accounting for the # of concurrent writers and ensures that transactionCleanup() recvs on transactionChan until that count reaches 0.

RFR @jphines @jehiah

I'll take payment in 🍻

When Writer cleans up after a connection error
it is possible for there to be a race between the cleanup
process (which stops the goroutine responsible for receiving
on transactionChan) and new concurrent writes (which send to
transactionChan), resulting in potential deadlocks until
a new successful connection is made.

This is exemplified in nsq_to_nsq, where Writer is used
alongside Reader such that, after seeing an error writing,
Reader enters backoff mode and sends RDY 1...

Because a stray write would be blocked on sending
to transactionChan the source nsqd would (correctly) stop
sending additional messages (RDY 1 == in-flight 1).

This, coupled with the fact that Writer only lazily connects
to its destination hosts, meant that until that in-flight
message timed out at the source nsqd there would be no
additional message delivered to force a re-connect and
free up the deadlock.

This fix adds accounting for the # of concurrent writers
and ensures that transactionCleanup() recvs on transactionChan
until that count reaches 0.
@jphines
Copy link
Contributor

jphines commented Oct 8, 2013

So changes LGTM. I've taken the liberty of deploying this to production, and will get back to you if this successfully resolves some of the issues we have been seeing.

@mreiferson
Copy link
Member Author

🔥 🚒 😈

@jphines
Copy link
Contributor

jphines commented Oct 10, 2013

Rebase/Squash please.

@mreiferson
Copy link
Member Author

no

jphines pushed a commit that referenced this pull request Oct 10, 2013
writer: potential deadlocks on cleanup
@jphines jphines merged commit 4d7f86d into nsqio:master Oct 10, 2013
@mreiferson
Copy link
Member Author

❤️ @jphines

sthulb pushed a commit to HailoOSS/go-nsq that referenced this pull request Sep 14, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants