-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
["outputs.kafka"] did not complete within its flush interval after kafka broker reboot (after upgrade from telegraf:1.20.2-alpine to telegraf:1.22.1-alpine) and does not recover without telegraf restart #11427
Comments
Hi, Please enable debug mode in your config and provide the telegraf logs during a failure.
This message alone is not necessary a bad thing. It only means that a flush happened and and it took longer than expected. It does not mean that Telegraf never is able to recover or connect. In that case, an error from the output about not being able to connect would appear. The kafka plugin works using the sarama library and given the complexity of a kafka deployment the additional debug logs from the sarama library are required to understand what else is going on. Also understand that the sarama plugin only uses the nodes you specify initially to collect metadata about the cluster. After that point, the metadata that it receives, which tells it the leader and other node information is used. Thanks |
Hello and thanks for your answer, I understand that log by itself is ok if it happens sometimes. I will try to capture debug logs and provide them. BR |
Here are the debug logs. A couple minutes later kafka broker recovered but still telegraf seems to fill up the buffer and never recovers. It recovers only if i reboot the containers. It did not happen to all agents i have in my environment but in around 30-40% of them. |
Based on the logs it does not seem like the node goes down for very long. Then after it gets shut down a couple of times sarama gives up trying to connect to that broker. There also never appears to be an election for a new leader when the node first goes down. It would be good to run with debugging on the other nodes as well to understand how some are recovering. Right now, there is not enough info in this bug other than further guessing. Can you collect debug logs on other nodes as well and provide one from one that recovers? |
I attach logs from a node that recovers after Kafka reboots (still some other telegraf containers did not recover): We are planning to downgrade telegraf to older version since this is critical for our ability to monitor our production VMs (where we never had such a problem). |
Thanks for those logs. Telegraf v1.20.2 had v1.27.2 of sarama. v1.32.0 of sarama was superseded due to IBM/sarama#2150, due to producer hanging issue. Given Telegraf has also released v1.23.1, which includes v1.34.1 of sarama, can I ask if you can reproduce using the latest sarama library with debug still enabled? I really do appreciate the back and forth. There is not a lot of kafka specific code and the sarama library is used for the connection and retries, so it would be really good to try out a new version. Thanks! |
I tested with telegraf v.1.23.1-alpine and i did not find and producer hanging. |
Awesome, thank you for trying that out! And yes, I do think it is related to the sarma issue. The PR that resolved the issue spelled out the following scenario:
This would affect any version of Telegraf that uses v1.32.0 of sarama, which I think is v1.22.1-v1.23.0. The resolution is to move to the later v1.34.1 of the sarama library in v1.23.1 of Telegraf, which contains the fix. As a result, I am going to close this issue, but if you do come across something related to this, please feel free to comment. Again I greatly appreciate you gathering those logs and trying the different versions! |
telegraf-1.27.4-1.x86_64.rpm |
Relevant telegraf.conf
Logs from Telegraf
System info
telegraf:1.22.1-alpine, docker engine 20.10.16
Docker
We have a cluster with around 100 VMs managed with docker swarm. On each VM we have several telegraf agents to monitor different metrics (microservice or VM specific) reporting a certain Kafka topic. Our Kafka is composed of three VMs for HA.
After upgrading from 1.20.2-alpine to 1.22.1-alpine it seems that when we reboot a Kafka VM (not all of them together at any time) e.g. for maintenance reasons some of the telegraf agents stuck in the following log:
"["outputs.kafka"] did not complete within its flush interval"
which from they never recover without restarting the relevant container. For this time we dont have any metric for these VMs.
The same happens for other input plugins as well we use like elastic or postgres (but not for all telegraf - seems to me that those that report most data produce the problem easier).
I tried to mitigate the issue by increasing the flush_interval, but it seems that i have to use values greater than the time the rebooting Kafka VM takes to come up to fix it which is not acceptable for our metrics. The same never happened with 1.20.2-alpine version of docker image.
I reproduced the same issue with Kafka reboot on our dev environment as well which is much smaller scale.
Steps to reproduce
1.Telegraf reporting data to a topic in Kafka.
2.Restart one of the Kafka VMs.
3.
...
Expected behavior
I would expect telegraf to send metrics to the other two VMs while one is rebooting and is not available, or at least recover after VM has come up.
Actual behavior
Some telegraf containers keep not reporting metrics and producing the following message in logs:
"["outputs.kafka"] did not complete within its flush interval"
Additional info
No response
The text was updated successfully, but these errors were encountered: