-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: sync: negative WaitGroup counter #449
Comments
Is this while running normally, or when closing the producer? |
What configuration are you running with? What do you see in the sarama logs when this occurs? I took a quick look and didn't see any obvious cases where the counter could get out of sync. |
This is between 3 and 8 minutes of running in production normally. We have two tiers of production traffic and only one was seeing this behavior (the one that gets the most traffic). There is no information in stdout or stderr. There is nothing in the logs other than messages of the form "client/metadata fetching metadata for all topics from broker kafka17.blah.blah:9092", "Connected to broker at kafka18--bbac.blah:9092 (registered as #10)" and "producer/flusher/12 starting up" We have reverted to version 3534171 and things are working great now |
|
Config:
I see earlier lines of the form
|
I have gone through the producer code and double-checked every case where we touch the wait group in question; they all appear sane. It would not terribly surprise me if there were some subtle bug in the retry logic (that is a fairly complex state machine), but anything touching that code path would generate logs which you are not seeing. The happy path in this case is much simpler and easily verified. I suppose this could be a bug with Go's waitgroup code, but that seems unlikely; it is widely used, and you are using a normal stable version of Go. How much traffic is seen at the tier where this occurs? We are running 68beb72 (v1.4.0) built with Go 1.4.2 on several hundred nodes, some of which are handling fairly substantial message throughput. If this was a common issue for you I'm very surprised we haven't run into it yet. I'm at a bit of a loss; could you provide the traces of any other sarama goroutines from the panic output? Maybe they contain a clue... |
In your sarama logs, do you see the producer retrying messages? A log snipper of a run that ends up with a crash may be useful. |
For what it's worth, the new version was working on some of our kafka tiers, just not one of them (where it got the panic). I can email you our full stack trace from the panic, but don't want to put it on a public ticket. |
Please email me then. My email is on my github profile. |
OK, I'm gonna take a step back and walk through my logic: the counter going negative must mean one of the following three cases:
The first case is not one we can realistically debug, and it is so unlikely I'm just going to ignore it. For the second case, there are a number of cases I have checked, but it's not trivially verifiable. I'll come back to this one. For the third case, the logic is pretty simple. The only case we might theoretically be missing a message is if it is submitted to the input channel with the |
Hi, I emailed you |
Email received, thanks. I'm updating my comment (which I accidentally submitted only half-typed) with a few other thoughts. |
I never reuse a ProducerMessage. Callers of my API put messages on a channel and return. One goroutine drains that channel as fast as it can and puts messages on kafka |
OK, a few notes from diving into the logs:
|
I assume you haven't seen anything come out of the |
I'm afraid I'm out of coherent ideas on this one... maybe I'll have a brainwave overnight. |
Nothing in the logs from the error channel. We do kafka rebalanced at times. Would a misbehaving server cause this issue? |
I wouldn't think so, all the error cases for malformed responses etc. are handled. I'm honestly baffled by this issue; I've not been able to reproduce it, and I can't even think of additional logging or checks that might help me track it down. If I think of anything I'll let you know. |
I've been thinking about this code some more and wonder if the WaitGroup logic should be reverted. I've only used WaitGroup when the use case was dead simple: Add followed by goroutine defer Done(). It's something easy to follow and prove. The WaitGroup use inside async_producer is not so simple to follow. Given that any errors in this logic, either now or 6 months later when the code has been refactored a few times, will result in unrecoverable panics that kill the entire application for the user, and the wait group code isn't strictly necessary, what are your thoughts on not using WaitGroup at all? On another note, most WaitGroup use uses defer Done() which makes sure the done is called in a panic. Is it possible something panic() somewhere and caused either an Add() not to happen or Done() to happen twice? |
The WaitGroup logic was a fairly elegant fix for #419 and as per the comments in #420 it is actually simpler to reason about than the code that was present before (which is not to say it is simple, just simpler). I'd be happy to find an even simpler solution which is not so fragile, but I don't know of one off the top of my head. When you say "the wait group code isn't strictly necessary" you seem to have one in mind - what is it?
No; unless you set |
@cep21 I'm curious if you've continued to see this, or if you've simply avoided any release after 1.3? We still have not seen anything like it and I still have no clue how it could be happening. |
We haven't upgraded b/c of this issue. I'm going to fork sarama and replace the waitgroup with a debug log that doesn't panic. I'll reply back if I can get it to log in production again. |
The fork we're deploying this week is at https://github.com/cep21/sarama (https://github.com/cep21/sarama/commit/da1ad6b748a2e3fbcda969e1e10ea1c9b2b16272) I will update if I can see it logging again. |
Thanks for investigating this 👍 |
This is still an issue. Using https://github.com/cep21/sarama/commit/da1ad6b748a2e3fbcda969e1e10ea1c9b2b16272 within minutes of a deploy I see Do you want me to email another stack trace? |
The problem isn't reproducible on small amounts of traffic: only in production. I wonder if there is some kind of race or gotcha when it suddenly sees a large amount of traffic. Something else special about our deployment: We are using a larger than normal number of partitions (128) |
Hmmm... what happens if you build and run the sample program at https://play.golang.org/p/t54O0eW2DI with your production settings (# brokers, # partitions, go compiler, etc)? It simulates the producer's use of the WaitGroup in a way that is trivially correct, so if it panics for you as well then we will know there is something wrong deeper in your toolchain. It should just run forever with no output. |
I don't think another stack trace would be too useful, but I'd be interested in seeing the complete sarama logs for a panicing run (right from |
Building the playground link with our production settings on our production machine shows no output. The are AWS instances. It feels unlikely that only this use of WaitGroup is broken due to a toolchain issue. I'll look at getting logs |
The logs are nothing interesting (this paste may appear a bit out of order b/c our logs are second resolution so they sometimes don't sort right).
|
There is about a two minute delay between |
Do you see any messages out the edit: turns out I asked this back in May and you said no, so I assume it hasn't changed. |
We have a few attempted kafka messages that are unencodable. They return a error during |
That's what I needed to know. The bug is blindingly obvious now. I'll have a fix shortly. |
I see it too, I think. parseResponse() could see a message that already had an error chan sent when an error happens earlier? |
Ya, |
Pending a proper refactor and tests etc.
I think it would be a bug for |
For the most part the message already flows through a state machine that terminates on one of those nodes. It's only in the final step where we wrap messages up into requests when that is no longer true (since as you point out that requires batching), and that's where the bug was :/ I'm going to have to sit down and think about the architecture; I think the final "correct" solution may involve more refactoring than I'd like. |
Pending a proper refactor and tests etc.
Seeing this occasionally in production:
Revision: 23d5233
The text was updated successfully, but these errors were encountered: