Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix brokerProducer goroutine leak #1442

Merged
merged 3 commits into from
Aug 8, 2019
Merged

fix brokerProducer goroutine leak #1442

merged 3 commits into from
Aug 8, 2019

Conversation

dnwe
Copy link
Collaborator

@dnwe dnwe commented Jul 31, 2019

Every create a of a brokerProducer spins off a goroutine for the run()
loop func, but this appeared to just be abandoned if the brokerProducer
was shutdown and would manifest as orphaned goroutine(s) with a long
select time leaking over time:

e.g.,

goroutine 6982 [select, 1868 minutes]:
github.com/Shopify/sarama.(*brokerProducer).run(0xc420c4ac00)
	/home/travis/build/org/repo/vendor/src/github.com/Shopify/sarama/async_producer.go:672 +0x258
github.com/Shopify/sarama.(*brokerProducer).(github.com/Shopify/sarama.run)-fm()
	/home/travis/build/org/repo/vendor/src/github.com/Shopify/sarama/async_producer.go:622 +0x2a
github.com/Shopify/sarama.withRecover(0xc4207aa3f0)
	/home/travis/build/org/repo/vendor/src/github.com/Shopify/sarama/utils.go:45 +0x43
created by github.com/Shopify/sarama.(*asyncProducer).newBrokerProducer
	/home/travis/build/org/repo/vendor/src/github.com/Shopify/sarama/async_producer.go:622 +0x1b8

Tidied up some chan waits that weren't checking the ok state and also
added an explicit stopchan for the run loop and a unittest to cover the
leak case.

@bai
Copy link
Contributor

bai commented Aug 1, 2019

Thanks for doing this. Looks like there's a bunch of legit failures 🤔

@dnwe
Copy link
Collaborator Author

dnwe commented Aug 1, 2019

@bai yeah I need to analyze that Travis output because it wasn't deterministic (a rebuild had different results to the previous build). I'll update the PR as I work through them

Every create a of a brokerProducer spins off a goroutine for the run()
loop func, but this appeared to just be abandoned if the brokerProducer
was shutdown and would manifest as orphaned goroutine(s) with a long
select time leaking over time:

e.g.,
```
goroutine 6982 [select, 1868 minutes]:
github.com/Shopify/sarama.(*brokerProducer).run(0xc420c4ac00)
	/home/travis/build/org/repo/vendor/src/github.com/Shopify/sarama/async_producer.go:672 +0x258
github.com/Shopify/sarama.(*brokerProducer).(github.com/Shopify/sarama.run)-fm()
	/home/travis/build/org/repo/vendor/src/github.com/Shopify/sarama/async_producer.go:622 +0x2a
github.com/Shopify/sarama.withRecover(0xc4207aa3f0)
	/home/travis/build/org/repo/vendor/src/github.com/Shopify/sarama/utils.go:45 +0x43
created by github.com/Shopify/sarama.(*asyncProducer).newBrokerProducer
	/home/travis/build/org/repo/vendor/src/github.com/Shopify/sarama/async_producer.go:622 +0x1b8
```

Tidied up some chan waits that weren't checking the ok state and also
added an explicit stopchan for the run loop and a unittest to cover the
leak case.

Signed-off-by: Dominic Evans <[email protected]>
@dnwe
Copy link
Collaborator Author

dnwe commented Aug 1, 2019

@bai I tracked down the issue (the abandoned channel is intentionally closed to trigger abandonment). I also ended up pulling in github.com/fortytw2/leaktest as a test-only dependency as it was better than my home-rolled NumGoroutine counting as it does things like status polling and filtering out routines it should ignore etc.

@dnwe dnwe closed this Aug 2, 2019
@dnwe dnwe reopened this Aug 2, 2019
dnwe added 2 commits August 2, 2019 13:12
The functional tests seem to have hovered around the 4 minute mark for a
while now. I think bumping the timeout up to 6m should make the Travis
runs less brittle.
- the existing coverage.txt concatenation included multiple `mode:
  atomic` lines which fail to parse correctly in the `go tool cover`
  tooling, so update Makefile so only one modeline exists

- add a call to `go tool cover -func coverage.txt` to output a
  per-function summary of the coverage in the Travis build log
@dnwe dnwe closed this Aug 2, 2019
@dnwe dnwe reopened this Aug 2, 2019
@@ -652,6 +651,7 @@ func (p *asyncProducer) newBrokerProducer(broker *Broker) *brokerProducer {
input: input,
output: bridge,
responses: responses,
stopchan: make(chan struct{}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if if context is better, but reckon that could be way more work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that's a reasonable suggestion, and I don't think it'd be any more work. Code wise the only difference is we'd select on <--ctx.Done() rather than <--stopchan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to you, I am fine either way for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varun06 ok, please go ahead and merge for now then if you're happy. I might revisit under a separate PR putting context everywhere

@@ -5,11 +5,11 @@ default: fmt vet errcheck test lint
# Taken from https://github.com/codecov/example-go#caveat-multiple-files
.PHONY: test
test:
echo "" > coverage.txt
echo "mode: atomic" > coverage.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my curiosity, why was it required?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varun06 I'm glad you asked :) it was described in the commit message, but happy to repeat it here:

the existing coverage.txt concatenation included multiple mode: atomic lines which fail to parse correctly in the go tool cover
tooling, so update Makefile so only one modeline exists

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂ I just had my tea :)
Thanks for reiterating.

@@ -9,6 +9,9 @@ import (
"sync/atomic"
"testing"
"time"

"github.com/fortytw2/leaktest"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

Copy link
Contributor

@varun06 varun06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@varun06 varun06 merged commit 9f8650a into IBM:master Aug 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants