fix: race condition(may panic) when closing consumer group #2331
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
We encounter a data race issue when closing consumer group.
Versions
Please specify real version numbers or git SHAs, not just "Latest" since that changes fairly regularly.
Logs
logs: CLICK ME
Investigation Results
After investigation, we find the root cause.
handleError
is used to collect errors fromheartBeatLoop
,partitionConsumer
,offsetManager
, etc.Some goroutines are also spawned for error collecting, including example1, example1, example3 ....
c.config.Consumer.Return.Errors=true
, those errors will be sent to a collective channelc.errors
.c.errors <- err
andclose(c.errors)
is non-deterministic and could even cause panic theoretically. A possible panic circumstance is shown in the below flow -- after checking the consumer group is not closed, the goroutine switches. Andclose(c.errors)
is called. When goroutine switches back,c.errors <- err
could cause panic as an error is sent to a closed channel.(here is a screenshot)
Solution
To prevent deadlock issue raised in #1581, I just create another dedicated lock instead of reusing the current
c.lock
one.