Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: sync: negative WaitGroup counter #449

Closed
cep21 opened this issue May 14, 2015 · 38 comments
Closed

panic: sync: negative WaitGroup counter #449

cep21 opened this issue May 14, 2015 · 38 comments
Labels

Comments

@cep21
Copy link

cep21 commented May 14, 2015

Seeing this occasionally in production:

Revision: 23d5233

panic: sync: negative WaitGroup counter

goroutine 444 [running]:
sync.(*WaitGroup).Add(0xc208e6c510, 0xffffffffffffffff)
        /usr/src/go/src/sync/waitgroup.go:67 +0x96
sync.(*WaitGroup).Done(0xc208e6c510)
        /usr/src/go/src/sync/waitgroup.go:85 +0x31
github.com/Shopify/sarama.(*asyncProducer).returnSuccesses(0xc208e6c4d0, 0xc29ffe9600, 0x3f, 0x40)
        /opt/goproj/Godeps/_workspace/src/github.com/Shopify/sarama/async_producer.go:753 +0xd3
github.com/Shopify/sarama.(*asyncProducer).flusher(0xc208e6c4d0, 0xc20814c540, 0xc2080b9800)
        /opt/goproj/Godeps/_workspace/src/github.com/Shopify/sarama/async_producer.go:560 +0xafe
github.com/Shopify/sarama.func·006()
        /opt/goproj/Godeps/_workspace/src/github.com/Shopify/sarama/async_producer.go:430 +0x43
github.com/Shopify/sarama.withRecover(0xc20802b220)
        /opt/goproj/Godeps/_workspace/src/github.com/Shopify/sarama/utils.go:42 +0x3a
created by github.com/Shopify/sarama.(*asyncProducer).messageAggregator
        /opt/goproj/Godeps/_workspace/src/github.com/Shopify/sarama/async_producer.go:430 +0x205
@wvanbergen
Copy link
Contributor

Is this while running normally, or when closing the producer?

@eapache
Copy link
Contributor

eapache commented May 14, 2015

What configuration are you running with? What do you see in the sarama logs when this occurs? I took a quick look and didn't see any obvious cases where the counter could get out of sync.

@cep21
Copy link
Author

cep21 commented May 14, 2015

This is between 3 and 8 minutes of running in production normally. We have two tiers of production traffic and only one was seeing this behavior (the one that gets the most traffic).

There is no information in stdout or stderr. There is nothing in the logs other than messages of the form "client/metadata fetching metadata for all topics from broker kafka17.blah.blah:9092", "Connected to broker at kafka18--bbac.blah:9092 (registered as #10)" and "producer/flusher/12 starting up"

We have reverted to version 3534171 and things are working great now

@eapache
Copy link
Contributor

eapache commented May 14, 2015

  • What values were you setting in the sarama.Config?
  • There is really nothing else in the logs immediately prior to the panic? No lines starting with producer/leader/ or producer/flusher/?
  • What version of Go are you using?

@cep21
Copy link
Author

cep21 commented May 14, 2015

Config:

    // Not pointers so we can deep copy the config
    conf := f.defaultSaramaConfig

    conf.ClientID = f.clientID

    conf.Net.DialTimeout = time.Second * 30

    conf.ChannelBufferSize = 256

    conf.Producer.RequiredAcks = sarama.WaitForLocal
    conf.Producer.Return.Errors = true
    conf.Producer.Return.Successes = true
    conf.Producer.Partitioner = partitioner
    conf.Producer.Flush.Frequency = time.Millisecond * 250
    conf.Producer.Flush.Messages = 3000
    conf.Producer.Flush.Bytes = 83886080

I see earlier lines of the form producer/flusher/12 starting up but none of the regex leader

< go version
go version go1.4.2 darwin/amd64

@eapache
Copy link
Contributor

eapache commented May 14, 2015

I have gone through the producer code and double-checked every case where we touch the wait group in question; they all appear sane.

It would not terribly surprise me if there were some subtle bug in the retry logic (that is a fairly complex state machine), but anything touching that code path would generate logs which you are not seeing. The happy path in this case is much simpler and easily verified.

I suppose this could be a bug with Go's waitgroup code, but that seems unlikely; it is widely used, and you are using a normal stable version of Go.

How much traffic is seen at the tier where this occurs? We are running 68beb72 (v1.4.0) built with Go 1.4.2 on several hundred nodes, some of which are handling fairly substantial message throughput. If this was a common issue for you I'm very surprised we haven't run into it yet.

I'm at a bit of a loss; could you provide the traces of any other sarama goroutines from the panic output? Maybe they contain a clue...

@wvanbergen
Copy link
Contributor

In your sarama logs, do you see the producer retrying messages? A log snipper of a run that ends up with a crash may be useful.

@cep21
Copy link
Author

cep21 commented May 14, 2015

For what it's worth, the new version was working on some of our kafka tiers, just not one of them (where it got the panic).

I can email you our full stack trace from the panic, but don't want to put it on a public ticket.

@eapache
Copy link
Contributor

eapache commented May 14, 2015

Please email me then. My email is on my github profile.

@eapache
Copy link
Contributor

eapache commented May 14, 2015

OK, I'm gonna take a step back and walk through my logic: the counter going negative must mean one of the following three cases:

  • a bug in the counter itself or a memory corruption or something like that
  • we are decrementing it when we shouldn't (e.g. multiple times for a single message)
  • we are not incrementing it when we should (e.g. missing a message that is input)

The first case is not one we can realistically debug, and it is so unlikely I'm just going to ignore it.

For the second case, there are a number of cases I have checked, but it's not trivially verifiable. I'll come back to this one.

For the third case, the logic is pretty simple. The only case we might theoretically be missing a message is if it is submitted to the input channel with the retries counter already non-zero. How are you generating your ProducerMessages, just with a literal or new call? Do you reuse ones that come out as successes or errors, or are all input messages allocated fresh?

@cep21
Copy link
Author

cep21 commented May 14, 2015

Hi, I emailed you

@eapache
Copy link
Contributor

eapache commented May 14, 2015

Email received, thanks. I'm updating my comment (which I accidentally submitted only half-typed) with a few other thoughts.

@cep21
Copy link
Author

cep21 commented May 14, 2015

I never reuse a ProducerMessage.

Callers of my API put messages on a channel and return.

One goroutine drains that channel as fast as it can and puts messages on kafka k.producer.Input() <- messageToSend. Another goroutine drains k.producer.Errors() and k.producer.Successes() to process results

@eapache
Copy link
Contributor

eapache commented May 14, 2015

OK, a few notes from diving into the logs:

@eapache
Copy link
Contributor

eapache commented May 14, 2015

I assume you haven't seen anything come out of the Errors channel immediately before the panic?

@eapache
Copy link
Contributor

eapache commented May 14, 2015

I'm afraid I'm out of coherent ideas on this one... maybe I'll have a brainwave overnight.

@cep21
Copy link
Author

cep21 commented May 14, 2015

Nothing in the logs from the error channel. We do kafka rebalanced at times. Would a misbehaving server cause this issue?

@eapache
Copy link
Contributor

eapache commented May 14, 2015

Would a misbehaving server cause this issue?

I wouldn't think so, all the error cases for malformed responses etc. are handled.

I'm honestly baffled by this issue; I've not been able to reproduce it, and I can't even think of additional logging or checks that might help me track it down. If I think of anything I'll let you know.

@cep21
Copy link
Author

cep21 commented May 18, 2015

I've been thinking about this code some more and wonder if the WaitGroup logic should be reverted. I've only used WaitGroup when the use case was dead simple: Add followed by goroutine defer Done(). It's something easy to follow and prove.

The WaitGroup use inside async_producer is not so simple to follow. Given that any errors in this logic, either now or 6 months later when the code has been refactored a few times, will result in unrecoverable panics that kill the entire application for the user, and the wait group code isn't strictly necessary, what are your thoughts on not using WaitGroup at all?

On another note, most WaitGroup use uses defer Done() which makes sure the done is called in a panic. Is it possible something panic() somewhere and caused either an Add() not to happen or Done() to happen twice?

@eapache
Copy link
Contributor

eapache commented May 18, 2015

The WaitGroup logic was a fairly elegant fix for #419 and as per the comments in #420 it is actually simpler to reason about than the code that was present before (which is not to say it is simple, just simpler).

I'd be happy to find an even simpler solution which is not so fragile, but I don't know of one off the top of my head. When you say "the wait group code isn't strictly necessary" you seem to have one in mind - what is it?

Is it possible something panic() somewhere and caused either an Add() not to happen or Done() to happen twice?

No; unless you set sarama.PanicHandler we don't rescue any panics. If some earlier panic had happened, your program would have crashed at that point and the WaitGroup being out-of-sync wouldn't have even come up.

@eapache
Copy link
Contributor

eapache commented Sep 5, 2015

@cep21 I'm curious if you've continued to see this, or if you've simply avoided any release after 1.3? We still have not seen anything like it and I still have no clue how it could be happening.

@cep21
Copy link
Author

cep21 commented Sep 7, 2015

We haven't upgraded b/c of this issue. I'm going to fork sarama and replace the waitgroup with a debug log that doesn't panic. I'll reply back if I can get it to log in production again.

@cep21
Copy link
Author

cep21 commented Sep 21, 2015

The fork we're deploying this week is at https://github.com/cep21/sarama (https://github.com/cep21/sarama/commit/da1ad6b748a2e3fbcda969e1e10ea1c9b2b16272) I will update if I can see it logging again.

@wvanbergen
Copy link
Contributor

Thanks for investigating this 👍

@cep21
Copy link
Author

cep21 commented Sep 23, 2015

This is still an issue.

Using https://github.com/cep21/sarama/commit/da1ad6b748a2e3fbcda969e1e10ea1c9b2b16272 within minutes of a deploy I see Had to recover from sarama state machine error: sync: negative WaitGroup counter

Do you want me to email another stack trace?

@cep21
Copy link
Author

cep21 commented Sep 23, 2015

The problem isn't reproducible on small amounts of traffic: only in production. I wonder if there is some kind of race or gotcha when it suddenly sees a large amount of traffic. Something else special about our deployment: We are using a larger than normal number of partitions (128)

@eapache
Copy link
Contributor

eapache commented Sep 24, 2015

Hmmm... what happens if you build and run the sample program at https://play.golang.org/p/t54O0eW2DI with your production settings (# brokers, # partitions, go compiler, etc)? It simulates the producer's use of the WaitGroup in a way that is trivially correct, so if it panics for you as well then we will know there is something wrong deeper in your toolchain. It should just run forever with no output.

@eapache
Copy link
Contributor

eapache commented Sep 24, 2015

I don't think another stack trace would be too useful, but I'd be interested in seeing the complete sarama logs for a panicing run (right from Initializing new client at the beginning up to the Had to recover from sarama state machine error...).

@cep21
Copy link
Author

cep21 commented Sep 24, 2015

Building the playground link with our production settings on our production machine shows no output. The are AWS instances. It feels unlikely that only this use of WaitGroup is broken due to a toolchain issue. I'll look at getting logs

@cep21
Copy link
Author

cep21 commented Sep 24, 2015

The logs are nothing interesting (this paste may appear a bit out of order b/c our logs are second resolution so they sometimes don't sort right).

Connected to broker at kafka22--bbaa.int.ourcorp.com:9092 (unregistered)
 client/metadata fetching metadata for all topics from broker kafka22--bbaa.int.ourcorp.com:9092
 Initializing new client
client/brokers registered new broker #10 at kafka26--bbac.int.ourcorp.com:9092
client/brokers registered new broker #6 at kafka25--bbab.int.ourcorp.com:9092
client/brokers registered new broker #20 at kafka33--bbab.int.ourcorp.com:9092
client/brokers registered new broker #14 at kafka22--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #9 at kafka17--bbab.int.ourcorp.com:9092
client/brokers registered new broker #21 at kafka34--bbac.int.ourcorp.com:9092
client/brokers registered new broker #12 at kafka20--bbab.int.ourcorp.com:9092
client/brokers registered new broker #17 at kafka30--bbab.int.ourcorp.com:9092
client/brokers registered new broker #13 at kafka24--bbac.int.ourcorp.com:9092
client/brokers registered new broker #18 at kafka31--bbac.int.ourcorp.com:9092
client/brokers registered new broker #7 at kafka36--bbac.int.ourcorp.com:9092
client/brokers registered new broker #19 at kafka32--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #8 at kafka16--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #16 at kafka29--bbaa.int.ourcorp.com:9092
Successfully initialized new client
client/brokers registered new broker #15 at kafka37--bbaa.int.ourcorp.com:9092
Initializing new client
client/brokers registered new broker #14 at kafka22--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #9 at kafka17--bbab.int.ourcorp.com:9092
client/brokers registered new broker #18 at kafka31--bbac.int.ourcorp.com:9092
client/metadata fetching metadata for all topics from broker kafka24--bbac.int.ourcorp.com:9092
 client/brokers registered new broker #20 at kafka33--bbab.int.ourcorp.com:9092
client/brokers registered new broker #13 at kafka24--bbac.int.ourcorp.com:9092
client/brokers registered new broker #17 at kafka30--bbab.int.ourcorp.com:9092
Connected to broker at kafka24--bbac.int.ourcorp.com:9092 (unregistered)
 client/brokers registered new broker #6 at kafka25--bbab.int.ourcorp.com:9092
client/brokers registered new broker #12 at kafka20--bbab.int.ourcorp.com:9092
client/brokers registered new broker #10 at kafka26--bbac.int.ourcorp.com:9092
client/brokers registered new broker #21 at kafka34--bbac.int.ourcorp.com:9092
client/brokers registered new broker #7 at kafka36--bbac.int.ourcorp.com:9092
Successfully initialized new client
client/brokers registered new broker #16 at kafka29--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #8 at kafka16--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #19 at kafka32--bbaa.int.ourcorp.com:9092
Initializing new client
client/brokers registered new broker #15 at kafka37--bbaa.int.ourcorp.com:9092
client/metadata fetching metadata for all topics from broker kafka24--bbac.int.ourcorp.com:9092
 Connected to broker at kafka24--bbac.int.ourcorp.com:9092 (unregistered)
 client/brokers registered new broker #10 at kafka26--bbac.int.ourcorp.com:9092
client/brokers registered new broker #14 at kafka22--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #6 at kafka25--bbab.int.ourcorp.com:9092
client/brokers registered new broker #20 at kafka33--bbab.int.ourcorp.com:9092
client/brokers registered new broker #21 at kafka34--bbac.int.ourcorp.com:9092
client/brokers registered new broker #9 at kafka17--bbab.int.ourcorp.com:9092
client/brokers registered new broker #13 at kafka24--bbac.int.ourcorp.com:9092
client/brokers registered new broker #17 at kafka30--bbab.int.ourcorp.com:9092
client/brokers registered new broker #7 at kafka36--bbac.int.ourcorp.com:9092
client/brokers registered new broker #12 at kafka20--bbab.int.ourcorp.com:9092
client/brokers registered new broker #16 at kafka29--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #18 at kafka31--bbac.int.ourcorp.com:9092
client/brokers registered new broker #15 at kafka37--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #19 at kafka32--bbaa.int.ourcorp.com:9092
client/brokers registered new broker #8 at kafka16--bbaa.int.ourcorp.com:9092
Successfully initialized new client
producer/flusher/17 starting up
 producer/flusher/21 starting up
 producer/flusher/18 starting up
 producer/flusher/16 starting up
 producer/flusher/20 starting up
 producer/flusher/19 starting up
 Connected to broker at kafka29--bbaa.int.ourcorp.com:9092 (registered as #16)
 Connected to broker at kafka31--bbac.int.ourcorp.com:9092 (registered as #18)
 Connected to broker at kafka34--bbac.int.ourcorp.com:9092 (registered as #21)
 Connected to broker at kafka30--bbab.int.ourcorp.com:9092 (registered as #17)
 Connected to broker at kafka33--bbab.int.ourcorp.com:9092 (registered as #20)
 Connected to broker at kafka32--bbaa.int.ourcorp.com:9092 (registered as #19)
 sync: negative WaitGroup counter

@cep21
Copy link
Author

cep21 commented Sep 24, 2015

There is about a two minute delay between Connected to broker at kafka32--bbaa.int.ourcorp.com:9092 (registered as #19) and sync: negative WaitGroup counter

@eapache
Copy link
Contributor

eapache commented Sep 24, 2015

Do you see any messages out the Errors() channel during one of these runs?

edit: turns out I asked this back in May and you said no, so I assume it hasn't changed.

@cep21
Copy link
Author

cep21 commented Sep 24, 2015

We have a few attempted kafka messages that are unencodable. They return a error during Encode() ([]byte, error) on purpose and we see those errors out the other end and log them.

@eapache
Copy link
Contributor

eapache commented Sep 24, 2015

We have a few attempted kafka messages that are unencodable. They return a error during Encode() ([]byte, error) on purpose and we see those errors out the other end and log them.

That's what I needed to know. The bug is blindingly obvious now. I'll have a fix shortly.

@cep21
Copy link
Author

cep21 commented Sep 24, 2015

I see it too, I think. parseResponse() could see a message that already had an error chan sent when an error happens earlier?

@eapache
Copy link
Contributor

eapache commented Sep 24, 2015

Ya, buildRequest doesn't remove messages which fail encoding from the batch.

eapache added a commit that referenced this issue Sep 24, 2015
Pending a proper refactor and tests etc.
@cep21
Copy link
Author

cep21 commented Sep 24, 2015

I think it would be a bug for returnError() or returnSuccesses() to operate on a msg twice. Do you have ideas on how to prevent this kind of thing in the future? Is there a pattern or way to structure the code such that returnError() or returnSuccess() end the lifetime of a message? Ideally the message itself would enter a state machine that terminates in one of those two nodes, but that may be difficult with all the batch processing.

@eapache
Copy link
Contributor

eapache commented Sep 24, 2015

For the most part the message already flows through a state machine that terminates on one of those nodes. It's only in the final step where we wrap messages up into requests when that is no longer true (since as you point out that requires batching), and that's where the bug was :/

I'm going to have to sit down and think about the architecture; I think the final "correct" solution may involve more refactoring than I'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants