Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues in otelsarama.WrapAsyncProducer #754

Merged
merged 13 commits into from
Jun 30, 2021
Merged

Fix issues in otelsarama.WrapAsyncProducer #754

merged 13 commits into from
Jun 30, 2021

Conversation

pellared
Copy link
Member

@pellared pellared commented Apr 20, 2021

Why

Usage of current Sarama instrumentation can lead to deadlocks and data races. These issues are not present if the instrumentation is not applied (created tests for deadlock scenarios).

The messages returned by the Errors() channel do not contain original Metadata. (I can make it as a separate PR if needed)

What

  1. Fix deadlocks by changing the concurrency design.

    • Use dedicated goroutines for consuming messages and for producing messages. It prevents the deadlocks and also makes the implementation less dependent on Sarama's internal implementation.
    • Use dedicated channels for closing instead of reusing the input channel.
    • Redesign the "spans cleanup" to make it NOT dependant on Sarma's implementation details (which may change). See: Possible race condition in Sarama async producer #435
    • All changes were done in a TDD way, however still there may be some edge cases that I have not thought about.
  2. Restore messages Metadata for the wrapped Errors channel.

Important Notice

Kudos to @MadVikingGod for a lot of help and for having concerns with the initial fixes 👍 🥇

Still very careful review is needed.

Example deadlock

Reproducing a sample deadlock:

package main

import (
	"fmt"
	"log"

	"github.com/Shopify/sarama"
	"go.opentelemetry.io/contrib/instrumentation/github.com/Shopify/sarama/otelsarama"
)

/*
	EXECUTE:

docker network create confluent

docker run --rm -d \
	--name zookeeper \
	--network confluent \
	-p 2181:2181 \
	-e ZOOKEEPER_CLIENT_PORT=2181 \
	confluentinc/cp-zookeeper:5.0.0

docker run --rm -d \
	--name kafka \
	--network confluent \
	-p 9092:9092 \
	-e KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181 \
	-e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092 \
	-e KAFKA_LISTENERS=PLAINTEXT://0.0.0.0:9092 \
	-e KAFKA_CREATE_TOPICS=gotest:1:1 \
	-e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 \
	confluentinc/cp-kafka:5.0.0

go run -race .

*/

func main() {
	cfg := sarama.NewConfig()
	cfg.Version = sarama.V0_11_0_0 // minimum version that supports headers which are required for tracing

	producer, err := sarama.NewAsyncProducer([]string{"localhost:9092"}, cfg)
	if err != nil {
		panic(err)
	}
	defer producer.Close()

	producer = otelsarama.WrapAsyncProducer(cfg, producer)

	msg := &sarama.ProducerMessage{
		Topic: "some-topic",
		Value: sarama.StringEncoder("Hello World"),
	}

	// close the producer
	if err := producer.Close(); err != nil {
		log.Fatalln(err)
	}

	// esnure that WrapAsyncProducer has exited
	if _, ok := <-producer.Successes(); ok {
		log.Fatalln("should not receive anything")
	}

	log.Println("Sending message", msg)

	defer func() {
		if r := recover(); r != nil {
			fmt.Println("RECOVERED FROM MAIN:", r)
			return
		}
	}()

	sarama.PanicHandler = func(r interface{}) { // sarama has a data race for PanicHandler...
		fmt.Println("RECOVERED IN SARAMA:", r)
	}

	producer.Input() <- msg // hangs forever (should panic from HERE - not internaly in sarama)

	log.Println("Message sent", msg)
}

@codecov
Copy link

codecov bot commented Apr 20, 2021

Codecov Report

Merging #754 (e938f1b) into main (d41f1f8) will increase coverage by 0.1%.
The diff coverage is 98.2%.

Impacted file tree graph

@@           Coverage Diff           @@
##            main    #754     +/-   ##
=======================================
+ Coverage   78.8%   79.0%   +0.1%     
=======================================
  Files         62      62             
  Lines       2707    2726     +19     
=======================================
+ Hits        2135    2154     +19     
  Misses       440     440             
  Partials     132     132             
Impacted Files Coverage Δ
...n/github.com/Shopify/sarama/otelsarama/producer.go 95.7% <98.2%> (+0.8%) ⬆️

Aneurysm9
Aneurysm9 previously approved these changes Apr 29, 2021
@pellared pellared requested a review from MadVikingGod May 4, 2021 10:29
@pellared pellared changed the title Fix possible deadlock in sarama instrumentation WrapAsyncProducer Fix possible deadlocks in sarama instrumentation WrapAsyncProducer May 4, 2021
@pellared
Copy link
Member Author

pellared commented May 4, 2021

After great tips from @MadVikingGod it ended almost in a rewrite
Also solves: #755

Careful review is needed...

@pellared pellared changed the title Fix possible deadlocks in sarama instrumentation WrapAsyncProducer Fix concurrency issues in sarama instrumentation WrapAsyncProducer May 4, 2021
@pellared pellared changed the title Fix concurrency issues in sarama instrumentation WrapAsyncProducer Fix concurrency issues in sarama instrumentation May 4, 2021
@pellared pellared requested a review from Aneurysm9 May 4, 2021 20:27
@pellared pellared marked this pull request as draft May 5, 2021 08:13
@pellared pellared marked this pull request as ready for review May 5, 2021 09:24
@pellared pellared marked this pull request as draft May 5, 2021 09:24
@pellared pellared closed this May 5, 2021
@pellared pellared reopened this May 5, 2021
@pellared pellared changed the title Fix concurrency issues in sarama instrumentation [WIP] Fix concurrency issues in sarama instrumentation May 5, 2021
@pellared pellared closed this May 5, 2021
@pellared pellared reopened this May 5, 2021
@pellared pellared changed the title [WIP] Fix concurrency issues in sarama instrumentation Fix concurrency issues in sarama instrumentation May 5, 2021
@pellared pellared marked this pull request as ready for review May 5, 2021 10:29
@pellared
Copy link
Member Author

pellared commented May 5, 2021

@Aneurysm9 @MadVikingGod Ready for review

@Aneurysm9 Aneurysm9 dismissed their stale review May 5, 2021 15:44

PR changed significantly, will re-review.

@pellared pellared marked this pull request as draft June 25, 2021 06:12
@pellared pellared changed the title Fix concurrency issues in otelsarama.WrapAsyncProducer Fix issues in otelsarama.WrapAsyncProducer Jun 25, 2021
@pellared pellared requested review from lizthegrey and Aneurysm9 June 25, 2021 08:48
@pellared pellared marked this pull request as ready for review June 25, 2021 08:48
@pellared pellared changed the title Fix issues in otelsarama.WrapAsyncProducer Fix issues in otelsarama.WrapAsyncProducer Jun 25, 2021
@pellared pellared requested a review from Aneurysm9 June 25, 2021 15:44
Copy link
Contributor

@MadVikingGod MadVikingGod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finished, but so far it looks good. The producers of the channels are the ones closing them, There is overall good hygiene around the close semantics. I will try and get to the tests later but consider this a soft +1.

Copy link
Contributor

@MadVikingGod MadVikingGod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MrAlias MrAlias merged commit b34e69b into open-telemetry:main Jun 30, 2021
@pellared pellared deleted the patch-1 branch June 30, 2021 21:31
garrettwegan referenced this pull request in garrettwegan/opentelemetry-go-contrib Jul 13, 2021
* Fix deadlocks and data race in WrapAsyncProducer

* Refine changelog

* Improve TestAsyncProducer_ConcurrencyEdgeCases

* Add TestWrapAsyncProducer_DrainsSuccessesAndErrorsChannels

* Refine WrapAsyncProducer to not depend on its implementation details

* Restore message metadata for errors

* Update changelog

* Remove empty line

* Remove empty line

* Refactor spans cleanup

* Update producer_test.go

Co-authored-by: Tyler Yahn <[email protected]>
garrettwegan referenced this pull request in garrettwegan/opentelemetry-go-contrib Jul 22, 2021
* Fix deadlocks and data race in WrapAsyncProducer

* Refine changelog

* Improve TestAsyncProducer_ConcurrencyEdgeCases

* Add TestWrapAsyncProducer_DrainsSuccessesAndErrorsChannels

* Refine WrapAsyncProducer to not depend on its implementation details

* Restore message metadata for errors

* Update changelog

* Remove empty line

* Remove empty line

* Refactor spans cleanup

* Update producer_test.go

Co-authored-by: Tyler Yahn <[email protected]>
@Aneurysm9 Aneurysm9 mentioned this pull request Jul 26, 2021
plantfansam referenced this pull request in plantfansam/opentelemetry-go-contrib Mar 18, 2022
Replace sdktrace.WithResourceAttributes() with WithResource()
@pellared pellared added this to the untracked milestone Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants