-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rare non-resumption of message delivery after connection interruption #4916
Comments
I believe we've just seen this happen too, due to a connection interruption during routine maintenance. I have never seen a partial failure like this, usually it's all or nothing. For what it's worth, we're using confluent-kafka-python / librdkafka 2.6.1 and AWS MSK 2.6.2. I have a topic with 48 partitions across 3 brokers, and 6 consumers of this topic. 3 of the consumers were affected while the other 3 happily continued processing messages from all subscribed partitions. After receiving a sum offset lag alert from my monitoring, I manually restarted the affected consumers, and things have been working fine since. But I do believe it should have automatically recovered from this. In the 3 that were affected, I see the below message in the logs following a handful of (expected) "connection refused" messages. This message did not appear in the unaffected consumers' logs.
Also for what it's worth, here is the output of kafka-consumer-groups during the incident. Most of the partitions have a normal lag in single or double digits, but affected partitions have a lag in the thousands (and were climbing until restarted). |
We have been seeing this for topics that have a replication factor of less than 2, been happening since upgrading to 1.8.2 afaik. |
I also encounter this problem when there is only 1 replica for my topic and the broker restart(killed -9) |
Description
In rare (~1/50) instances, a connection interruption causes librdkafka to no longer receive messages from topics in consume calls. It appears to be caused by a change in partition counts to 0, that then never gets reset, even after future resubscribe calls. Once in this state, a consumer fails to resume message consumption for at least hours (at which point we restart the consumer to fix the issue).
How to reproduce
This is a difficult issue to reproduce, as we have only ever seen it after interruption of network connectivity (e.g. by physically powering down a network switch, waiting some time, and then powering it back on), or on simultaneous broker restarts (or, e.g., by only using a single broker and restarting that apache kafka instance), and even then it only happens infrequently. So we have not yet been able to collect full debug logs during an event.
However, what we do know is that most of the time, the consumer automatically recovers. In cases where they recover, we see log lines like:
However, infrequently (we estimate 1 in 50 times) after a physical network interruption, the consumer permanently stops processing messages. In which case we see logs like:
I've left in here the application message log info just before and after, to show that clearly in most cases librdkafka reconnects and processing continues. HOWEVER, in rare cases the partition counts are set to 0, and then the consumer never receives further messages, even after a resubscribe is executed.
For each network event, which consumer(s) are affected appears random -- each consumer does not exhibit this issue most of the time, and those that do exhibit this behavior after a network outage is different each time. There is no apparent correlation which which topic(s) are subscribed to.
Speculations
This smells like some kind of race condition, but of what?
Additional Details
All of our producers and consumers are written in python, and use the confluent-kafka-python wrapper for librdkafka. This is being reported here because it looks like a bug in librdkafka itself, not an artifact from the wrapper library. In most cases, the consumers have a very simple structure:
Since we see the application logging lines for the resubscription process, we infer that this is not a hang, but instead for some reason librdkafka does not bring topics set to 0 partitions back later (or maybe erroneously sets them to 0 partitions in the first place?).
Checklist
debug=..
as necessary) from librdkafka; see above.The text was updated successfully, but these errors were encountered: