-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak sometimes when restarting kafka partitions #560
Comments
The obvious candidate for this kind of issue would be the |
For some, I see repeated messages like
For others, I see logs like
When I Some other partition log messages
|
That should be normal, partitions which are on other brokers (ones that are not restarted) won't log much if anything. The thing that is curious for me is the partitions (e.g. 89 in your sample logs) repeatedly selecting and then abandoning the same broker. It looks to me like the Kafka cluster might be confused at this point: every time we ask it who leads the partition it says broker 21, but every time we try and send a message to broker 21 it returns I'll dig a little deeper. |
This surprises me a lot: given the logs posted above I'd have expected a steady stream of messages on the error channel from partition 89 as each message to it is retried 3 times and then fails.
I think I know why this is happening, and agree that ideally it shouldn't, but it's almost more of a symptom than a bug. Sarama buffers internally while retrying partitions, and in this case because we are retrying in an apparent infinite loop we are also buffering infinitely. If we can fix those partitions then the memory problem should effectively go away because Sarama will be able to flush its buffers properly. The next time you hit a situation like this, it is worth checking the kafka broker and controller logs, and also the contents of zookeeper, for the partitions having trouble. Perhaps one of the brokers is serving stale metadata. |
Hmm, reminds me a bit of https://issues.apache.org/jira/browse/KAFKA-2082 in which brokers can serve incorrect/stale metadata if they are temporarily isolated from the zookeeper cluster. Could something like that be happening here? |
You're right about error messages. I rechecked the log and I see only these messages on the error channel
I could understand kafka getting in some strange state, but the weird thing for us is that restarting the server that is sending points to kafka fixes the problem. Even when we stop seeing messages about "Failed to produce message to topic" and stop getting messages on the error channel we still see the same slope of increased memory usage.. Attached is a chart with memory usage on one axis and # of error messages on the other axis. https://cloud.githubusercontent.com/assets/20358/10912580/0ea7f30c-81ff-11e5-86bc-509993aa8c61.png ![image](img width="1425" alt="memory_usage" src="https://cloud.githubusercontent.com/assets/20358/10912580/0ea7f30c-81ff-11e5-86bc-509993aa8c61.png") |
This is theoretically explainable because Sarama picks a random broker on startup to get metadata from; if only one of the brokers is serving bad metadata, restarting sarama will likely pick a different one and "fix" the problem.
That is super-weird. |
To be sure I understand; after the brokers restart, the final "steady" state of the producer is:
|
I am unable to reconcile the repeated state changes with the lack of messages coming out on the error channel. Or do the state change logs stop at the same time as the errors stop? If they do both stop together: what is the final state change for an affected partition? If the state changes continue even after the errors stop: somehow, messages that are at the configured retry limit (3, per default) in the Tangent 1: Is the Tangent 2: I assume based on your initial report that the SHA you are running is 7e88a99 or thereabouts. |
Here is a histogram of log messages coming from sarama logger The short burst at around 18:50 is
|
Pprof inuse space at the time
pprof inuse objects
|
The actual version I'm using is https://github.com/cep21/sarama which is a wrapper around 4faee61 |
Those pprof profiles don't seem to indicate the many hundreds of MBs of usage that is shown in your graph. If you ask for allocated instead (
That's just before I fixed the waitgroup bug, not after...? |
I filtered the pprof to sarama functions. If I remove the sarama functioun filter it shows the objects I'm putting into sarama's channel.
Sorry, right, the one just before you fixed it. |
I've written some tests and tried a few things here and I haven't found anything suspicious: buffers are always getting flushed appropriately for me. Since memory is consistently increasing even after the logs and errors stop: what is the very last set of |
I was able to create a test which approximates some of the symptoms you're seeing I think, but it still seems somewhat different. I pushed it as branch However, when I run a memory profile, the result shows all the memory being properly freed by the end of the test (once the error channel is drained), so it doesn't really look like a leak. I'd be curious if you run this test yourself and compare it to the details of your production system if this looks like it's a similar or different problem to the one you're actually seeing? |
I emailed you the sarama log output. I'll try running the test on a production machine. |
Without more precise log ordering it is difficult to get any useful information out of these logs. There are e.g. hundreds of entries at 27:33 which might be informative or might not, depending on what order the state changes happen in. |
Would it be useful if there were internal statistics about the sarama state machine that I could track or even report when issues happen? For example, if sarama had a |
It might be useful, but accumulating a consistent snapshot across all of sarama's goroutines would be quite difficult without some sort of global stop-the-world lock. One thing I have noticed in the logs you sent is that partition 65 transitions several times to state I've done a bit of playing with the state machine and haven't come up with any explanations yet, but there's a lot of ground to cover. Any further details you might be able to provide on the behaviour of that partition in particular could be useful (e.g. if you have broker-side msgs/sec metrics broken out for that partition they might be informative). |
One kafka broker failed and we replaced the broker. It kept the same broker ID. And then we restarted the tier. The leak happened on the restart, but only after a few were restarted. We were able to restart a good number before things started increasing in memory usage. |
One thing unique about our production env is that we sometimes error the message encode. Could that somehow contribute? |
Well the failing encode was the cause of #449 and you're not actually running with that fixed, just with the work-around you implemented to rescue the panic. I would certainly try moving to vanilla v1.6.1 in order to eliminate that as the problem. |
I'll probably update my current branch to v1.6.1 but still wrap sync.waitgroup inside sarama since I'm still hesitant to run with sync.WaitGroup's fatal stop-the-world behavior if there is another bug that we can't currently identify. However, I don't see how this could be the cause of the current memory leak behavior. I just mentioned it to see if it could cause a memory leak in sarama during kafka restarts. |
After some investigation, I am reasonably sure this bug is just another side-effect of the encoding-error bug (#449) and that updating to 1.6.1 (with or without the safety wrapper) will fix it. In detail: the encoding bug I fixed in 1.6.1 was that when a message failed to encode we would return it out the error channel immediately, but leave it in the buffer to be returned again when the broker response arrived (typically as a success). Since the message had been added only once but was being removed twice (once as a failure, then again incorrectly as a success) this threw the wait-group out of sync. However, if the broker response was not a success but a retriable error (e.g. when the cluster is restarting), then the failed message would be retried with the rest of the batch. Because we clear the retry metadata when we return an error, this means the retried message would have incorrect retry metadata. The state machine is reasonably robust and I'm not actually sure how this would have caused a memory leak vs just mis-ordering certain messages, but it definitely caused problems and clearly explains the weird state changes on partition 65 I noted above. If this still happens with a 1.6.1-based release then please reopen and I'll take another look. |
I know this isn't a lot to go on, but hope you can give me pointers to debug this.
Restarting kafka brokers occasionally produces ever increasing memory usage until the go process dies OOM. This isn't consistent, but especially happens we have to restart multiple kafka nodes at once. When I run pprof/heap I see the memory usage by kafka messages but cannot tell where inside the kafka code the memory is being held onto :( It only happens occasionally and only on our production tier. I also don't see increased goroutine numbers. We're running with 256 brokers and about 12 kafka hosts, using the kafka revision just after you fixed the WaitGroup bug. The log has messages like
Failed to connect to XYZ: getsocopt: connection refused
client/brokers deregistering XYZ
client/brokers deregistered broker #-1 at kafkaXYZ:9092
...
Even when kafka is stable, and the messages go away, memory usage still increases forever until we OOM. A restart and things work great.
Here is the config we use
The biggest issues for us are that even when kafka recovers memory usage still increases forever in this strait line upward.
When kafka failures happen, is it possible that multiple buffers could be created internally that end up buffering points?
Note we also don't see increases in messages on the error channel.
The text was updated successfully, but these errors were encountered: