-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Deadlock in groupbytrace #1847
Conversation
You've been a hero, @chris-smith-zocdoc! Your report has helped me with #1842, which didn't originally have a fix for #1811. But then, once I spotted the problem, fixing it was actually quite easy: just Could you give it a try? I think that the solution there would be preferable to this one here. What do you think? By the way, the merge conflicts to this one weren't caused by me :-) |
@jpkrohling that may be one source of the deadlocks but it is not the only one. I tried with just that change and my tests still deadlock you can view it here https://github.com/open-telemetry/opentelemetry-collector/compare/master...Zocdoc:issue_1811_storage_fix_only?expand=1 I'll update this pr to include the defer fix also, but I believe this more invasive change is necessary to address the other deadlock I mentioned in the pr description/comments on the issue. I'm in the process of fixing the merge conflicts |
Here is a small example that illustrates what is happening more clearly. This test will deadlock func TestExampleDeadlock(t *testing.T) {
c := make(chan struct{}, 1)
// 1 item in channel, buffer is now full
c <- struct{}{}
for _ = range c { // dequeue first item, buffer empty
log.Printf("enquing first item from consumer")
// enqueue, buffer full now
c <- struct{}{}
log.Printf("success")
log.Printf("enquing second item from consumer")
// deadlock because the consumer is trying to produce events on a full buffer
c <- struct{}{}
// will never reach this line
log.Printf("success")
break
}
assert.True(t, true)
} |
…d by the groupbytrace processor There are only two events now, one for adding spans and one for releasing the trace. This removes the prior cases where the processor would need to emit events during the processing of an event, which would trigger a deadlock when the buffer was full. This change also modifies the behavior for incomplete evicted traces, these traces are now passed to the downstream processors early instead of discarded.
…sor with a very short wait duration
820ad63
to
ab9f1c2
Compare
…detection can be used Stabilize the tests by using synchronization rather than time.sleep calls Wait on inprogress goroutines when shutting the processor down
Codecov Report
@@ Coverage Diff @@
## master #1847 +/- ##
==========================================
- Coverage 91.15% 91.07% -0.09%
==========================================
Files 272 272
Lines 16256 16230 -26
==========================================
- Hits 14818 14781 -37
- Misses 1010 1016 +6
- Partials 428 433 +5
Continue to review full report at Codecov.
|
@jpkrohling please review this PR. Also, friendly reminder, you agreed to move groupbytrace to contrib :-) |
@tigrannajaryan, I believe @chris-smith-zocdoc will close this in favor of #1842 :-) You can assign his other PR to me, though: #1864 |
Closing in favor of #1842 Will open new issues in the contrib repo once that is moved |
Signed-off-by: Hu Shuai <[email protected]>
Description:
Fixes #1811
The deadlock in the groupbytrace processor was caused by the consumer goroutine (the one consuming from the event machine) trying to write data to the channel. This would work sometimes because of the channel buffer but deadlock once the buffer became full. I've refactored the design so that the event machine is much simpler and only has two events
traceReceived
andtraceExpired
. By doing this I've removed the cases where the reader would need to write events, removing the possibility of deadlocks. This also moved all modifications to the datastructures to happen under the consumer goroutine which fixes some other synchronization bugs with concurrent access to the storage becoming out of sync with the ringbuffer.A second deadlock was discovered in the shutdown procedure where the channel and mutex locks could be obtained in the inverse order, leading to deadlock. I replaced the mutex with a channel and select which will avoid this.
Testing:
Added test cases
@jpkrohling