-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix brokerProducer goroutine leak #1442
Conversation
Thanks for doing this. Looks like there's a bunch of legit failures 🤔 |
@bai yeah I need to analyze that Travis output because it wasn't deterministic (a rebuild had different results to the previous build). I'll update the PR as I work through them |
Every create a of a brokerProducer spins off a goroutine for the run() loop func, but this appeared to just be abandoned if the brokerProducer was shutdown and would manifest as orphaned goroutine(s) with a long select time leaking over time: e.g., ``` goroutine 6982 [select, 1868 minutes]: github.com/Shopify/sarama.(*brokerProducer).run(0xc420c4ac00) /home/travis/build/org/repo/vendor/src/github.com/Shopify/sarama/async_producer.go:672 +0x258 github.com/Shopify/sarama.(*brokerProducer).(github.com/Shopify/sarama.run)-fm() /home/travis/build/org/repo/vendor/src/github.com/Shopify/sarama/async_producer.go:622 +0x2a github.com/Shopify/sarama.withRecover(0xc4207aa3f0) /home/travis/build/org/repo/vendor/src/github.com/Shopify/sarama/utils.go:45 +0x43 created by github.com/Shopify/sarama.(*asyncProducer).newBrokerProducer /home/travis/build/org/repo/vendor/src/github.com/Shopify/sarama/async_producer.go:622 +0x1b8 ``` Tidied up some chan waits that weren't checking the ok state and also added an explicit stopchan for the run loop and a unittest to cover the leak case. Signed-off-by: Dominic Evans <[email protected]>
@bai I tracked down the issue (the abandoned channel is intentionally closed to trigger abandonment). I also ended up pulling in github.com/fortytw2/leaktest as a test-only dependency as it was better than my home-rolled NumGoroutine counting as it does things like status polling and filtering out routines it should ignore etc. |
The functional tests seem to have hovered around the 4 minute mark for a while now. I think bumping the timeout up to 6m should make the Travis runs less brittle.
- the existing coverage.txt concatenation included multiple `mode: atomic` lines which fail to parse correctly in the `go tool cover` tooling, so update Makefile so only one modeline exists - add a call to `go tool cover -func coverage.txt` to output a per-function summary of the coverage in the Travis build log
@@ -652,6 +651,7 @@ func (p *asyncProducer) newBrokerProducer(broker *Broker) *brokerProducer { | |||
input: input, | |||
output: bridge, | |||
responses: responses, | |||
stopchan: make(chan struct{}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if if context
is better, but reckon that could be way more work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that's a reasonable suggestion, and I don't think it'd be any more work. Code wise the only difference is we'd select on <--ctx.Done()
rather than <--stopchan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up to you, I am fine either way for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@varun06 ok, please go ahead and merge for now then if you're happy. I might revisit under a separate PR putting context everywhere
@@ -5,11 +5,11 @@ default: fmt vet errcheck test lint | |||
# Taken from https://github.com/codecov/example-go#caveat-multiple-files | |||
.PHONY: test | |||
test: | |||
echo "" > coverage.txt | |||
echo "mode: atomic" > coverage.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my curiosity, why was it required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@varun06 I'm glad you asked :) it was described in the commit message, but happy to repeat it here:
the existing coverage.txt concatenation included multiple
mode: atomic
lines which fail to parse correctly in thego tool cover
tooling, so update Makefile so only one modeline exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦♂ I just had my tea :)
Thanks for reiterating.
@@ -9,6 +9,9 @@ import ( | |||
"sync/atomic" | |||
"testing" | |||
"time" | |||
|
|||
"github.com/fortytw2/leaktest" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Every create a of a brokerProducer spins off a goroutine for the run()
loop func, but this appeared to just be abandoned if the brokerProducer
was shutdown and would manifest as orphaned goroutine(s) with a long
select time leaking over time:
e.g.,
Tidied up some chan waits that weren't checking the ok state and also
added an explicit stopchan for the run loop and a unittest to cover the
leak case.