-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow consumers, subscription stuck problem #1897
Comments
Cant we wait for all subscription to happen before starting? |
@dim please helpout here. |
@alok87 sorry, I have no idea what's going on here tbh. I think that there is a rebalancing issue in your cluster and your handlers cannot exit the loop quickly enough. Can you try and reduce your for msg := range claim.Messages() {
...
} and see if this fixes things. If it does, you can try and increase your timeout configs (this is usually the culprit). If it doesn't, you will need to do some lower-level debugging yourself (sorry), as it's really impossible for me to go through your specific use case. I am more than happy to review the PR if are able to isolate the problem. |
@dim I have been trying to figure out why only few topics out of many receive this message |
Keeping My consumers are really slow which all values would require tuning? I keep seeing these messages still https://github.com/Shopify/sarama/blob/master/consumer.go#L826 also |
Sorry, I haven't worked with Kafka for quite some time, but timeouts are documented e.g. in https://github.com/Shopify/sarama/blob/master/config.go#L261 and you can find more information at https://kafka.apache.org/documentation/ as the server and client settings need to be aligned. The The fact that
Obviously, there are various timeouts, e.g. all consumers within a group must issue a Hope that helps! |
@dim Thank you for the TLDR; and taking time to explain this. Means a lot. I have tried to debug this issue and have found Below is the log for
This is happening for all 3 brokers we have. Please suggest if you see a problem here. |
Found the second batch of buffer is not being set because the previous why it is stuck there? 😢 |
@dim I found the problem. I have slow consumers and my Problem This slows down consumption if the subscription did not happen for all of them at once. |
Sarama updated from 1.27.2 to master Add changes to use ticker instead IBM/sarama#1899 to solve #1897 Fixes IBM/sarama#1897 Fixes #160
thanks @dim again for the help. |
Sarama updated from 1.27.2 to master Add changes to use ticker instead IBM/sarama#1899 to solve #1897 Fixes IBM/sarama#1897 Fixes #160
Sarama updated from 1.27.2 to master Add changes to use ticker instead IBM/sarama#1899 to solve #1897 Fixes IBM/sarama#1897 Fixes #160
Sarama updated from 1.27.2 to master Add changes to use ticker instead IBM/sarama#1899 to solve #1897 Fixes IBM/sarama#1897 Fixes #160
Use session.Context().Done() chan to decide to stop processing. For long running processes, add checking the context.Done() chan between each step to stop processing. This is the only way processing goroutines can preemptively exit on rebalance. If it doesn't exit, keep in mind the processing of same message can happen in parallel as the message offset was not committed.
|
I also have a long running process that takes 5 minutes and maxprocessing time is adjusted accordingly(set to 10 minutes). Our kafka consumer group is consuming same message again and again when our application in crashed/panics and kubernetes is restarting it. it generally works fine if we do a restart of application but if application is in crash state, it keeps on replaying the same last message |
Upgraded to use Kafka 3.0.0 Not using the fork as we are running separate consumer groups for every table and IBM/sarama#1897 issue does not happen to us. We are not impacted by this bug anymore: IBM/sarama#1897
The first few fetches from Kafka may only fetch data from one or two partitions, starving the rest for a very long time (depending on message size / processing time) Once a member joins the consumer groups and receives its partitions, they are fed into the "subscription manager" from different go routines. The subscription manager then performs batching and executes a fetch for all the partitions. However, it seems like the batching logic in `subscriptionManager` is faulty, perhaps assuming that `case:` order prioritizes which `case` should be handled when all are signaled which is not the case, according to the go docs (https://golang.org/ref/spec#Select_statements): ``` If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection. Otherwise, if there is a default case, that case is chosen. If there is no default case, the "select" statement blocks until at least one of the communications can proceed. ``` For example - if you receive 64 partitions, each will be handled in their own go routine which sends the partition information to the `bc.input` channel. After an iteration there is a race between `case event, ok := <-bc.input` which will batch the request and `case bc.newSubscriptions <- buffer` which will trigger an immediate fetch of the 1 or 2 partitions that made it into the batch. This issue only really affects slow consumers with short messages. If the condition happens with 1 partition being in the batch (even though 63 extra partitions have been claimed but didn't make it into the batch) the fetch will ask for 1MB (by default) of messages from that single partition. If the messages are only a few bytes long and processing time is minutes, you will not perform another fetch for hours. Contributes-to: #1608 #1897
The first few fetches from Kafka may only fetch data from one or two partitions, starving the rest for a very long time (depending on message size / processing time) Once a member joins the consumer groups and receives its partitions, they are fed into the "subscription manager" from different go routines. The subscription manager then performs batching and executes a fetch for all the partitions. However, it seems like the batching logic in `subscriptionManager` is faulty, perhaps assuming that `case:` order prioritizes which `case` should be handled when all are signaled which is not the case, according to the go docs (https://golang.org/ref/spec#Select_statements): ``` If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection. Otherwise, if there is a default case, that case is chosen. If there is no default case, the "select" statement blocks until at least one of the communications can proceed. ``` For example - if you receive 64 partitions, each will be handled in their own go routine which sends the partition information to the `bc.input` channel. After an iteration there is a race between `case event, ok := <-bc.input` which will batch the request and `case bc.newSubscriptions <- buffer` which will trigger an immediate fetch of the 1 or 2 partitions that made it into the batch. This issue only really affects slow consumers with short messages. If the condition happens with 1 partition being in the batch (even though 63 extra partitions have been claimed but didn't make it into the batch) the fetch will ask for 1MB (by default) of messages from that single partition. If the messages are only a few bytes long and processing time is minutes, you will not perform another fetch for hours. Contributes-to: #1608 #1897
The first few fetches from Kafka may only fetch data from one or two partitions, starving the rest for a very long time (depending on message size / processing time) Once a member joins the consumer groups and receives its partitions, they are fed into the "subscription manager" from different go routines. The subscription manager then performs batching and executes a fetch for all the partitions. However, it seems like the batching logic in `subscriptionManager` is faulty, perhaps assuming that `case:` order prioritizes which `case` should be handled when all are signaled which is not the case, according to the go docs (https://golang.org/ref/spec#Select_statements): ``` If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection. Otherwise, if there is a default case, that case is chosen. If there is no default case, the "select" statement blocks until at least one of the communications can proceed. ``` For example - if you receive 64 partitions, each will be handled in their own go routine which sends the partition information to the `bc.input` channel. After an iteration there is a race between `case event, ok := <-bc.input` which will batch the request and `case bc.newSubscriptions <- buffer` which will trigger an immediate fetch of the 1 or 2 partitions that made it into the batch. This issue only really affects slow consumers with short messages. If the condition happens with 1 partition being in the batch (even though 63 extra partitions have been claimed but didn't make it into the batch) the fetch will ask for 1MB (by default) of messages from that single partition. If the messages are only a few bytes long and processing time is minutes, you will not perform another fetch for hours. Contributes-to: #1608 #1897
The first few fetches from Kafka may only fetch data from one or two partitions, starving the rest for a very long time (depending on message size / processing time) Once a member joins the consumer groups and receives its partitions, they are fed into the "subscription manager" from different go routines. The subscription manager then performs batching and executes a fetch for all the partitions. However, it seems like the batching logic in `subscriptionManager` is faulty, perhaps assuming that `case:` order prioritizes which `case` should be handled when all are signaled which is not the case, according to the go docs (https://golang.org/ref/spec#Select_statements): ``` If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection. Otherwise, if there is a default case, that case is chosen. If there is no default case, the "select" statement blocks until at least one of the communications can proceed. ``` For example - if you receive 64 partitions, each will be handled in their own go routine which sends the partition information to the `bc.input` channel. After an iteration there is a race between `case event, ok := <-bc.input` which will batch the request and `case bc.newSubscriptions <- buffer` which will trigger an immediate fetch of the 1 or 2 partitions that made it into the batch. This issue only really affects slow consumers with short messages. If the condition happens with 1 partition being in the batch (even though 63 extra partitions have been claimed but didn't make it into the batch) the fetch will ask for 1MB (by default) of messages from that single partition. If the messages are only a few bytes long and processing time is minutes, you will not perform another fetch for hours. Contributes-to: #1608 #1897
The first few fetches from Kafka may only fetch data from one or two partitions, starving the rest for a very long time (depending on message size / processing time) Once a member joins the consumer groups and receives its partitions, they are fed into the "subscription manager" from different go routines. The subscription manager then performs batching and executes a fetch for all the partitions. However, it seems like the batching logic in `subscriptionManager` is faulty, perhaps assuming that `case:` order prioritizes which `case` should be handled when all are signaled which is not the case, according to the go docs (https://golang.org/ref/spec#Select_statements): ``` If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection. Otherwise, if there is a default case, that case is chosen. If there is no default case, the "select" statement blocks until at least one of the communications can proceed. ``` For example - if you receive 64 partitions, each will be handled in their own go routine which sends the partition information to the `bc.input` channel. After an iteration there is a race between `case event, ok := <-bc.input` which will batch the request and `case bc.newSubscriptions <- buffer` which will trigger an immediate fetch of the 1 or 2 partitions that made it into the batch. This issue only really affects slow consumers with short messages. If the condition happens with 1 partition being in the batch (even though 63 extra partitions have been claimed but didn't make it into the batch) the fetch will ask for 1MB (by default) of messages from that single partition. If the messages are only a few bytes long and processing time is minutes, you will not perform another fetch for hours. Contributes-to: #1608 #1897
The first few fetches from Kafka may only fetch data from one or two partitions, starving the rest for a very long time (depending on message size / processing time) Once a member joins the consumer groups and receives its partitions, they are fed into the "subscription manager" from different go routines. The subscription manager then performs batching and executes a fetch for all the partitions. However, it seems like the batching logic in `subscriptionManager` is faulty, perhaps assuming that `case:` order prioritizes which `case` should be handled when all are signaled which is not the case, according to the go docs (https://golang.org/ref/spec#Select_statements): ``` If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection. Otherwise, if there is a default case, that case is chosen. If there is no default case, the "select" statement blocks until at least one of the communications can proceed. ``` For example - if you receive 64 partitions, each will be handled in their own go routine which sends the partition information to the `bc.input` channel. After an iteration there is a race between `case event, ok := <-bc.input` which will batch the request and `case bc.newSubscriptions <- buffer` which will trigger an immediate fetch of the 1 or 2 partitions that made it into the batch. This issue only really affects slow consumers with short messages. If the condition happens with 1 partition being in the batch (even though 63 extra partitions have been claimed but didn't make it into the batch) the fetch will ask for 1MB (by default) of messages from that single partition. If the messages are only a few bytes long and processing time is minutes, you will not perform another fetch for hours. Contributes-to: #1608 #1897
The first few fetches from Kafka may only fetch data from one or two partitions, starving the rest for a very long time (depending on message size / processing time) Once a member joins the consumer groups and receives its partitions, they are fed into the "subscription manager" from different go routines. The subscription manager then performs batching and executes a fetch for all the partitions. However, it seems like the batching logic in `subscriptionManager` is faulty, perhaps assuming that `case:` order prioritizes which `case` should be handled when all are signaled which is not the case, according to the go docs (https://golang.org/ref/spec#Select_statements): ``` If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection. Otherwise, if there is a default case, that case is chosen. If there is no default case, the "select" statement blocks until at least one of the communications can proceed. ``` For example - if you receive 64 partitions, each will be handled in their own go routine which sends the partition information to the `bc.input` channel. After an iteration there is a race between `case event, ok := <-bc.input` which will batch the request and `case bc.newSubscriptions <- buffer` which will trigger an immediate fetch of the 1 or 2 partitions that made it into the batch. This issue only really affects slow consumers with short messages. If the condition happens with 1 partition being in the batch (even though 63 extra partitions have been claimed but didn't make it into the batch) the fetch will ask for 1MB (by default) of messages from that single partition. If the messages are only a few bytes long and processing time is minutes, you will not perform another fetch for hours. Contributes-to: #1608 #1897 Co-authored-by: Dominic Evans <[email protected]>
This should now be fixed on main |
The first few fetches from Kafka may only fetch data from one or two partitions, starving the rest for a very long time (depending on message size / processing time) Once a member joins the consumer groups and receives its partitions, they are fed into the "subscription manager" from different go routines. The subscription manager then performs batching and executes a fetch for all the partitions. However, it seems like the batching logic in `subscriptionManager` is faulty, perhaps assuming that `case:` order prioritizes which `case` should be handled when all are signaled which is not the case, according to the go docs (https://golang.org/ref/spec#Select_statements): ``` If one or more of the communications can proceed, a single one that can proceed is chosen via a uniform pseudo-random selection. Otherwise, if there is a default case, that case is chosen. If there is no default case, the "select" statement blocks until at least one of the communications can proceed. ``` For example - if you receive 64 partitions, each will be handled in their own go routine which sends the partition information to the `bc.input` channel. After an iteration there is a race between `case event, ok := <-bc.input` which will batch the request and `case bc.newSubscriptions <- buffer` which will trigger an immediate fetch of the 1 or 2 partitions that made it into the batch. This issue only really affects slow consumers with short messages. If the condition happens with 1 partition being in the batch (even though 63 extra partitions have been claimed but didn't make it into the batch) the fetch will ask for 1MB (by default) of messages from that single partition. If the messages are only a few bytes long and processing time is minutes, you will not perform another fetch for hours. Contributes-to: IBM#1608 IBM#1897 Co-authored-by: Dominic Evans <[email protected]>
Our consumer group is handling 100+ topics (only one partition for all of them, partition0 for all 100 topics)
For example in the loader handler processes in batches. It batches and processes based on message count and also on time. The loop keeps going to the ticker case and since the batch size = 0 due to no inserts, nothing ever gets processed. We are stuck in this loop.
Out of 100+ topics only 57 topics got message
[sarama] added subscription to
, rest 43 never got subscribed so they are stuck in the endless loop waiting for messages to come to read channel.Please suggest if this is an expected behaviour and how we can fix it. Is there some configuration I am missing?
The text was updated successfully, but these errors were encountered: