fix: prevent deadlock between subscription manager and consumer goroutines #2194

niamster · 2022-03-23T11:37:01Z

Since subscription manager was improved to batch the subscriptions (see dadcd80)
it created a deadlock in the case when new subscription are added after a rebalance.

After a rebalance, the subscription is stuck on pushing to the wait channel, which is drained only if there are no active subscriptions.

…tionConsumer threads since subscription manager was improved to batch subscriptions (see IBM@dadcd80) it created a deadlock in the case when new subscription are added after a rebalance

lavoiesl · 2022-03-28T19:45:17Z

I was trying to troubleshoot another bug and wrote this:

import (
	"context"
	"fmt"
	"math/rand"
	"strconv"
	"testing"
	"time"

	"github.com/stretchr/testify/require"
)

type inlineHandler func(sess ConsumerGroupSession, claim ConsumerGroupClaim) error

func (inlineHandler) Setup(_ ConsumerGroupSession) error   { return nil }
func (inlineHandler) Cleanup(_ ConsumerGroupSession) error { return nil }
func (h inlineHandler) ConsumeClaim(sess ConsumerGroupSession, claim ConsumerGroupClaim) error {
	return h(sess, claim)
}

func TestConsumerGroupRebalance(t *testing.T) {
	const topic = "test-topic"

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	config := NewConfig()
	config.Producer.Return.Successes = true
	config.Consumer.Offsets.Initial = OffsetOldest
	c, err := NewClient([]string{"localhost:9092"}, config)
	require.NoError(t, err)

	admin, err := NewClusterAdminFromClient(c)
	require.NoError(t, err)

	err = admin.DeleteTopic(topic)
	t.Log(err)

	err = admin.CreateTopic(topic, &TopicDetail{NumPartitions: 4, ReplicationFactor: 1}, false)
	t.Log(err)

	p, err := NewSyncProducerFromClient(c)
	require.NoError(t, err)

	g, err := NewConsumerGroupFromClient("group", c)
	require.NoError(t, err)

	timer := time.NewTimer(1 * time.Minute)
	defer timer.Stop()

	resets := make(chan struct{})

	go func() {
		for {
			select {
			case <-ctx.Done():
				return
			case <-timer.C:
				timer.Reset(time.Duration(rand.Float64()*100_000) * time.Microsecond)
				resets <- struct{}{}
			}
		}
	}()

	ch := make(chan string)
	for i := 0; i < 10; i++ {
		go func(g ConsumerGroup) {
			for {
				if ctx.Err() != nil { // context was cancelled
					t.Log("consumer-group stopped")
					break
				}

				err := g.Consume(ctx, []string{topic}, inlineHandler(func(sess ConsumerGroupSession, claim ConsumerGroupClaim) error {
					for {
						select {
						case <-resets:
							return nil
						case <-ctx.Done():
							return ctx.Err()
						case message, ok := <-claim.Messages():
							if !ok {
								return nil
							}
							ch <- string(message.Value)
							sess.MarkMessage(message, "")
						}
					}
				}))

				t.Logf("consumer-group disconnected: %s", err)
			}
		}(g)
	}

	time.Sleep(2 * time.Second)
	t.Log("booted")
	timer.Reset(0)

	n := 1000

	go func() {
		for i := 0; i < n; i++ {
			time.Sleep(time.Duration(rand.Float64()*1_000) * time.Microsecond)
			m := &ProducerMessage{
				Topic: topic,
				Value: StringEncoder(strconv.FormatInt(int64(i), 10)),
			}
			_, _, err := p.SendMessage(m)
			require.NoError(t, err)
			//p.Input() <- m
		}
	}()

	m := make(map[string]struct{}, n)
	for len(m) < n {
		line := <-ch
		m[line] = struct{}{}
		t.Logf("received: %s, count: %d", line, len(m))
	}

	cancel()

	err = g.Close()
	require.NoError(t, err)

	err = p.Close()
	require.NoError(t, err)

	err = c.Close()
	require.NoError(t, err)
}

It introduces chaos, randomly sleeping and exiting claims to trigger rebalances.
The expected behaviour is to receive all the 1000 messages, with a bunch of random disconnects while consuming:

    consumer_group_test.go:110: booted
...
    consumer_group_test.go:143: received: 185, count: 186
    consumer_group_test.go:143: received: 186, count: 187
    consumer_group_test.go:103: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:103: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:143: received: 187, count: 188
    consumer_group_test.go:143: received: 189, count: 189
    consumer_group_test.go:103: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:103: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:143: received: 188, count: 190
    consumer_group_test.go:143: received: 193, count: 191
    consumer_group_test.go:143: received: 196, count: 192
...
    consumer_group_test.go:103: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:82: consumer-group stopped
    consumer_group_test.go:103: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:82: consumer-group stopped

When broken, it just gets stuck in a disconnect loop:

    consumer_group_test.go:110: booted
    consumer_group_test.go:104: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:104: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:104: consumer-group disconnected: %!s(<nil>)
...

I verified that:

It works fine on v1.32.0
It fails on latest main (f07b7b8)
It works fine on latest main, but with dadcd80 reverted

However, I tried applying this patch and it didn't seem to fix it, it just gets stuck in the "disconnected" loop

niamster · 2022-03-28T20:50:53Z

Thanks @lavoiesl for writing a unit test ❤️ . I will try to reproduce the problem with it and see what is wrong.

niamster · 2022-03-28T23:36:36Z

Thanks @lavoiesl for writing a unit test ❤️ . I will try to reproduce the problem with it and see what is wrong.

@lavoiesl If I understand correctly cg.Consume should not be called multiple times because it uses the same member ID (and some internals of the cg structures are shared). I'm afraid that you might be seeing different problem.
Were you able to reproduce this issue with a real program with and without this patch?

Since I'm back to this PR I will try to write a functional test that illustrates original problem I was trying to fix.

niamster · 2022-03-29T14:14:08Z

Unfortunately I was not able to create a simple test case that could robustly reproduce the issue with consumer groups.
However I was able to steadily reproduce with a generic consumer.

With Kafka launched with the help of docker-compose, if I run

TOXIPROXY_ADDR=localhost:8474 KAFKA_VERSION=3.1.0 DEBUG=true go test -v -timeout 20s -tags functional -run TestConsumerGroupDeadlock ./

I can reproduce the deadlock:

goroutine 53 [chan send]:
github.com/Shopify/sarama.(*brokerConsumer).subscriptionManager(0xc0001ba5f0)
        /Users/dmytro.milinevsky/dev/sarama/consumer.go:915 +0x266
github.com/Shopify/sarama.withRecover(0x0)
        /Users/dmytro.milinevsky/dev/sarama/utils.go:43 +0x3e
created by github.com/Shopify/sarama.(*consumer).newBrokerConsumer
        /Users/dmytro.milinevsky/dev/sarama/consumer.go:869 +0x19f

@pavius @dnwe can you please review this PR 🙏 ?

lavoiesl · 2022-03-29T14:14:27Z

Fair point, here's a similar version, but with different clients:

	p, err := NewSyncProducerFromClient(c)
	require.NoError(t, err)

	n := 1000

	ch := make(chan string, n)
	for i := 0; i < 27; i++ {
		go func(i int) {
			time.Sleep(time.Duration(i) * time.Second)
			for {
				if ctx.Err() != nil { // context was cancelled
					t.Log("consumer-group stopped")
					break
				}

				config := NewConfig()
				config.ClientID = fmt.Sprintf("test-%d", i)
				config.Consumer.Offsets.Initial = OffsetOldest
				config.Consumer.Group.Heartbeat.Interval = 100 * time.Millisecond
				config.Consumer.Group.Rebalance.Timeout = 100 * time.Millisecond
				g, err := NewConsumerGroup([]string{"sarama.railgun:9092"}, "group", config)
				require.NoError(t, err)

				alive := true

				for alive {
					err = g.Consume(ctx, []string{topic}, inlineHandler(func(sess ConsumerGroupSession, claim ConsumerGroupClaim) error {
						for {
							select {
							case <-ctx.Done():
								return ctx.Err()
							case message, ok := <-claim.Messages():
								if !ok {
									return nil
								}
								if rand.Float64() < 0.05 {
									alive = false
									return errors.New("reset")
								}
								ch <- fmt.Sprintf("%d-%s", i, message.Value)
								sess.MarkMessage(message, "")
							}
						}
					}))
					t.Logf("consumer-group disconnected: %s", err)
				}

				require.NoError(t, g.Close())
			}
		}(i)
	}

	time.Sleep(5 * time.Second)
	t.Log("booted")

	go func() {
		for i := 0; i < n; i++ {
			m := &ProducerMessage{
				Topic: topic,
				Value: StringEncoder(strconv.FormatInt(int64(i), 10)),
			}
			_, _, err := p.SendMessage(m)
			t.Logf("send message %d: %s", i, err)
		}
	}()

	m := make(map[string]struct{}, n)
	for len(m) < n {
		line := <-ch
		m[line] = struct{}{}
		t.Logf("received: %s, count: %d", line, len(m))
	}

	cancel()

	err = p.Close()
	require.NoError(t, err)

	err = c.Close()
	require.NoError(t, err)

It eventually succeeds, but with zillions of disconnects.
On my machine, the test completes in 13s on 1.32.0, but in 1m40s on latest main.

No, I do not have a real-world example of this happening, I don't have any production apps that use sarama's unreleased versions

niamster · 2022-03-29T14:25:26Z

@lavoiesl thanks! I think that extra delay I've introduced (250ms) causes the test to run slower. I think it's fair to lower the delay to, say 100ms.
I will try to think what other kind of sync mechanism it's possible to implement to ensure that it's not easy to break it with future code changes. Although I'm afraid that it would require bigger refactoring of this code.
Let's see what the maintainers of Sarama tell us :), what do you think?

niamster · 2022-03-31T00:16:05Z

@lavoiesl I think I've spotted another regression - the loop creates a new timer each time new subscription is added, thereby the overall "wait" time is increased by 250ms.
I've pushed another commit that fixes this, additionally I've reduced nap time to 100ms (I think that's reasonable).
Can you please give a try with your test? Thanks!

Fixing golangci-lint warnings

dnwe

LGTM — the changes look reasonable to me now, lets go ahead and merge this and continue to perform testing on main prior to cutting the next release

niamster · 2022-04-13T22:03:24Z

LGTM — the changes look reasonable to me now, lets go ahead and merge this and continue to perform testing on main prior to cutting the next release

Thanks a lot Dominic

consumer: avoid obscure sync between subscriptionManager and subscrip…

a9627a0

…tionConsumer threads since subscription manager was improved to batch subscriptions (see IBM@dadcd80) it created a deadlock in the case when new subscription are added after a rebalance

niamster changed the title ~~consumer: avoid obscure sync between subscriptionManager and subscriptionConsumer threads~~ consumer: fix deadlock between subscription Manager and Consumer threads Mar 23, 2022

pavius mentioned this pull request Mar 27, 2022

Fix deadlock introduced in #2109 #2196

Closed

tests:functional: reproduce consumer deadlock

76608c0

consumer: reduce amount of nap time when waiting for new subscriptions

e0c263a

dnwe force-pushed the fix-consumer-subscription-manager branch from 21d1414 to e0c263a Compare April 13, 2022 21:34

dnwe added the fix label Apr 13, 2022

fix(test): use errors.Is and ignore DeleteTopic rc

9a9a211

Fixing golangci-lint warnings

dnwe changed the title ~~consumer: fix deadlock between subscription Manager and Consumer threads~~ fix: prevent deadlock between subscription manager and consumer goroutines Apr 13, 2022

dnwe approved these changes Apr 13, 2022

View reviewed changes

dnwe merged commit 8f8d8da into IBM:main Apr 13, 2022

This was referenced May 11, 2022

Update module github.com/Shopify/sarama to v1.33.0 shift/vflow#33

Merged

fix(deps): update module github.com/shopify/sarama to v1.33.0 shortlink-org/shortlink#3960

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent deadlock between subscription manager and consumer goroutines #2194

fix: prevent deadlock between subscription manager and consumer goroutines #2194

niamster commented Mar 23, 2022

lavoiesl commented Mar 28, 2022

niamster commented Mar 28, 2022

niamster commented Mar 28, 2022

niamster commented Mar 29, 2022

lavoiesl commented Mar 29, 2022

niamster commented Mar 29, 2022

niamster commented Mar 31, 2022

dnwe left a comment

niamster commented Apr 13, 2022

fix: prevent deadlock between subscription manager and consumer goroutines #2194

fix: prevent deadlock between subscription manager and consumer goroutines #2194

Conversation

niamster commented Mar 23, 2022

lavoiesl commented Mar 28, 2022

niamster commented Mar 28, 2022

niamster commented Mar 28, 2022

niamster commented Mar 29, 2022

lavoiesl commented Mar 29, 2022

niamster commented Mar 29, 2022

niamster commented Mar 31, 2022

dnwe left a comment

Choose a reason for hiding this comment

niamster commented Apr 13, 2022