Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync Producer stalled - Unable to send messages to Kafka after a write tcp: broken pipe error #2173

Closed
shweta-fourkites opened this issue Mar 10, 2022 · 18 comments

Comments

@shweta-fourkites
Copy link

Versions

Please specify real version numbers or git SHAs, not just "Latest" since that changes fairly regularly.

Sarama | Go | Kafka
v1.31.1 | 1.16.6 | 2.2.1

Configuration

What configuration values are you using for Sarama and Kafka?

config := sarama.NewConfig()
config.Producer.Return.Successes = true
config.Version = sarama.V2_2_1_0

Logs

When filing an issue please provide logs from Sarama and Kafka if at all
possible. You can set sarama.Logger to a log.Logger to capture Sarama debug
output.

logs: CLICK ME

2022/03/09 21:04:40.888730 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint1:9092
2022/03/09 21:04:41.189656 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092
2022/03/09 21:04:41.295781 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092
2022/03/09 21:04:41.295838 client.go:929: client/metadata got error from broker 77 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe
2022/03/09 21:04:41.295914 sarama.go:122: Closed connection to broker test-broker-endpoint:9092
2022/03/09 21:04:41.295927 sarama.go:122: client/brokers deregistered broker #77 at test-broker-endpoint:9092
2022/03/09 21:04:41.295944 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint2:9092
2022/03/09 21:04:41.489721 sarama.go:122: client/brokers registered new broker #77 at test-broker-endpoint:9092
2022/03/09 21:14:40.889372 sarama.go:122: client/metadata fetching metadata for all topics from broker test-kafka-endpoint:9092
2022/03/09 21:14:41.189707 config.go:559: ClientID is the default of 'sarama', you should consider setting it to something application-specific.
2022/03/09 21:14:41.189739 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092
2022/03/09 21:14:41.192789 sarama.go:122: Connected to broker at test-broker-endpoint:9092 (registered as #77)
2022/03/09 21:14:41.295503 config.go:559: ClientID is the default of 'sarama', you should consider setting it to something application-specific.
2022/03/09 21:14:41.295542 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint3:9092.compute-1.amazonaws.com:9092
2022/03/09 21:14:41.300616 sarama.go:122: Connected to broker at test-broker-endpoint3:9092.compute-1.amazonaws.com:9092 (registered as #74)
2022/03/09 21:24:40.888707 sarama.go:122: client/metadata fetching metadata for all topics from broker test-kafka-endpoint:9092
2022/03/09 21:24:41.189268 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint3:9092.compute-1.amazonaws.com:9092
2022/03/09 21:24:41.189316 client.go:929: client/metadata got error from broker 74 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe
2022/03/09 21:24:41.189360 sarama.go:122: Closed connection to broker test-broker-endpoint3:9092.compute-1.amazonaws.com:9092
2022/03/09 21:24:41.189370 sarama.go:122: client/brokers deregistered broker #74 at test-broker-endpoint3:9092.compute-1.amazonaws.com:9092
2022/03/09 21:24:41.189376 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint2:9092
2022/03/09 21:24:41.189396 client.go:929: client/metadata got error from broker 47 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe
2022/03/09 21:24:41.189427 sarama.go:122: Closed connection to broker test-broker-endpoint2:9092
2022/03/09 21:24:41.189436 sarama.go:122: client/brokers deregistered broker #47 at test-broker-endpoint2:9092
2022/03/09 21:24:41.189441 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092
2022/03/09 21:24:41.232417 sarama.go:122: client/brokers registered new broker #47 at test-broker-endpoint2:9092
2022/03/09 21:24:41.232435 sarama.go:122: client/brokers registered new broker #74 at test-broker-endpoint3:9092.compute-1.amazonaws.com:9092
2022/03/09 21:24:41.295667 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint3:9092.compute-1.amazonaws.com:9092
2022/03/09 21:34:40.888928 sarama.go:122: client/metadata fetching metadata for all topics from broker test-kafka-endpoint:9092
2022/03/09 21:34:41.189283 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint4.compute-1.amazonaws.com:9092
2022/03/09 21:34:41.189340 client.go:929: client/metadata got error from broker 48 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe
2022/03/09 21:34:41.189405 sarama.go:122: Closed connection to broker test-broker-endpoint4.compute-1.amazonaws.com:9092
2022/03/09 21:34:41.189414 sarama.go:122: client/brokers deregistered broker #48 at test-broker-endpoint4.compute-1.amazonaws.com:9092
2022/03/09 21:34:41.189420 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092
2022/03/09 21:34:41.235015 sarama.go:122: client/brokers registered new broker #48 at test-broker-endpoint4.compute-1.amazonaws.com:9092
2022/03/09 21:34:41.295952 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint5.compute-1.amazonaws.com:9092
2022/03/09 21:34:41.296034 client.go:929: client/metadata got error from broker 75 while fetching metadata: write tcp test-endpoint->test-endpoint: write: broken pipe
2022/03/09 21:34:41.296111 sarama.go:122: Closed connection to broker test-broker-endpoint5.compute-1.amazonaws.com:9092
2022/03/09 21:34:41.296129 sarama.go:122: client/brokers deregistered broker #75 at test-broker-endpoint5.compute-1.amazonaws.com:9092
2022/03/09 21:34:41.296143 config.go:559: ClientID is the default of 'sarama', you should consider setting it to something application-specific.
2022/03/09 21:34:41.296194 sarama.go:122: client/metadata fetching metadata for all topics from broker test-broker-endpoint:9092
2022/03/09 21:34:41.299010 sarama.go:122: Connected to broker at test-broker-endpoint:9092 (registered as #77)
2022/03/09 21:34:41.346014 sarama.go:122: client/brokers registered new broker #75 at test-broker-endpoint5.compute-1.amazonaws.com:9092
2022/03/09 21:44:40.889425 sarama.go:122: client/metadata fetching metadata for all topics from broker test-kafka-endpoint:9092

Problem Description

We are using SyncProducer to push messages to a destination kafka topic.
Sometimes, we see the write tcp host:47562->host:9092: write: broken pipe error in logs which is printed after the timeout.
Once this happens, the metadata fetch happens every 10 mins (default), nothing else is happening between the fetches.
The producer is never able to send messages after this. The producer is just struck.

On restarting the application, the producer is able to obtain connection again and start sending the messages (Until this happens again!)

Has anybody faced this? Does it have to do with any producer configuration that we missed overriding?
The expectation is for the producer to connect back to broker and start sending messages automatically without requiring an app restart.

@shweta-fourkites
Copy link
Author

shweta-fourkites commented Mar 10, 2022

@dnwe @bai @slaunay

@slaunay
Copy link
Contributor

slaunay commented Mar 10, 2022

@shweta-fourkites if you are using Samara 1.31.1, this is likely a duplicate of #2129 (producer deadlock) that was fixed in #2133.
The fix is available in 1.32.0 released recently but not yet listed in the CHANGELOG.md.

There is also another similar issue #2150 that has been difficult to reproduce where the producer would hang and a record with a nil key and a nil value is produced to the target topic.

@shweta-fourkites
Copy link
Author

@slaunay Thank you for checking this.
I will update the version to 1.32.0 and see if that helps..

@shweta-fourkites
Copy link
Author

@slaunay ,
with version v1.32.0

2022/03/14 15:54:42.498472 async_producer.go:611: producer/leader/testtopic/4 state change to [retrying-1]
2022/03/14 15:54:42.498480 async_producer.go:621: producer/leader/testtopic/4 abandoning broker 47
2022/03/14 15:54:42.498651 async_producer.go:809: producer/broker/47 input chan closed
2022/03/14 15:54:42.498691 async_producer.go:899: producer/broker/47 shut down
2022/03/14 15:54:42.599000 sarama.go:122: client/metadata fetching metadata for [testtopic] from broker testendpoint
2022/03/14 15:57:39.017057 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint
2022/03/14 15:57:39.315335 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint1
2022/03/14 15:57:39.623175 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint
2022/03/14 16:07:39.017423 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint
2022/03/14 16:07:39.315637 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint1
2022/03/14 16:07:39.623326 sarama.go:122: client/metadata fetching metadata for all topics from broker testendpoint

The connection is not re-established unless i restart my application

@shweta-fourkites
Copy link
Author

@slaunay @dnwe is this expected even after the fix in V1.32.0? After upgrading to 1.32.0 I see the frequency of this has reduced but still it is happening for few consumers in our cluster.
If this is expected, is there a plan to fix this in the near future?

@slaunay
Copy link
Contributor

slaunay commented Mar 19, 2022

@shweta-fourkites I was able to reproduce what looks like #2150 yesterday.

There seem to be a race condition during retries where:

  • a brokerProducer is shutting down after getting an I/O error and the retry logic is triggered
  • when a retry is seen by the partitionProducer a message with the fin flag is sent to the "unheathly" brokerProducer and that brokerProducer is abandoned (closed)
  • a new "healthy" brokerProducer is then created
  • if the "unheathy" brokerProducer (being closed) fails again then the retry logic is triggered again
  • when a retry is seen by the partitionProducer a message with the fin flag is sent again but this time to the "heathly" brokerProducer
  • such "healthy" brokerProducer treat the fin flag as a regular message and use it in the next ProduceRequest
  • the fin message is not bubbling up to the partitionerProducer that will buffer new message till it sees that fin message
  • as the fin message escaped to the target Kafka topic the partitionProducer is not producing anymore (just buffering)

It's not clear to me if this was also possible before honouring produce request pipelining.

Can you confirm that when the producer stops producing to that particular topic partition, a Kafka record with a null key and value was produced in the target Kafka topic?

I was going to submit a simple test case to reproduce it with a fix but there seem to be a fix already in #2182.

@dnwe Should we add another retract directive for v1.32.0 referencing #2150 (and marking that issue as fixed) when v1.32.1 is released and if #2182 gets merged?

@shweta-fourkites
Copy link
Author

Hi @slaunay @niamster @dnwe , I patched the changes in #2182 and I am no longer facing the issue! Are we planning to get this merged and release a new version with the fix anytime soon?

@slaunay
Copy link
Contributor

slaunay commented Mar 25, 2022

Thanks a lot for testing the fix @shweta-fourkites and confirming that the issue is fixed.

I'm not a maintainer for Sarama but I believe the goal is indeed to have the fix merged, add a retract for v1.32.0 and release a new version.

@softwarespot
Copy link

For now we have reverted back to 1.30.1 until this has been released

@mkevac
Copy link
Contributor

mkevac commented Apr 6, 2022

Decided to downgrade to 1.30.1 too as I have stumbled upon same bug.

@eafzali
Copy link

eafzali commented May 10, 2022

We are also still seeing the issue happening in 1.32.0, it's happening less often than 1.31.0 but still happens from time to time. Is there any plans for releasing the fix?

@dnwe
Copy link
Collaborator

dnwe commented May 10, 2022

@eafzali we are basically ready to cut a release, but we were waiting for more confirmation that github.com/Shopify/Sarama@main fixed the issue for people before cutting the release. Have you managed to test with main?

@eafzali
Copy link

eafzali commented May 10, 2022

Not yet, we have switched back to 1.30.1 and so far we didn't have a problem. We can try out main but it's really hard for us to "confirm" anything as it randomly happens for us once or twice a week. I guess related to the stability of our Kafka Cluster :)

@shweta-fourkites
Copy link
Author

@dnwe any idea when the changes from #2182 can be released to a stable branch for use in production?

@david-bergman
Copy link

experiencing the same issue with sarama v1.37.2 connecting to Confluent cloud kafka instances. write: broken pipe

@shmilyoo
Copy link

shmilyoo commented Dec 5, 2022

same error in v1.37.2
producer on error callback : kafka: Failed to produce message to topic apm-aidc-aggs-sync: write tcp ip:port->kafka-ip:9092: write: broken pipe

@david-bergman
Copy link

issue continues with sarama v1.38.1... write tcp ... write: broken pipe

@david-bergman
Copy link

perhaps with the backing of IBM this issue can finally be addressed ? using v1.38.2-0.20230327141928-9127f1c432c0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants