Fix Deadlock in groupbytrace #1847

chris-smith-zocdoc · 2020-09-24T15:08:55Z

Description:
Fixes #1811

The deadlock in the groupbytrace processor was caused by the consumer goroutine (the one consuming from the event machine) trying to write data to the channel. This would work sometimes because of the channel buffer but deadlock once the buffer became full. I've refactored the design so that the event machine is much simpler and only has two events traceReceived and traceExpired. By doing this I've removed the cases where the reader would need to write events, removing the possibility of deadlocks. This also moved all modifications to the datastructures to happen under the consumer goroutine which fixes some other synchronization bugs with concurrent access to the storage becoming out of sync with the ringbuffer.

A second deadlock was discovered in the shutdown procedure where the channel and mutex locks could be obtained in the inverse order, leading to deadlock. I replaced the mutex with a channel and select which will avoid this.

Testing:
Added test cases

@jpkrohling

jpkrohling · 2020-09-24T15:50:04Z

You've been a hero, @chris-smith-zocdoc! Your report has helped me with #1842, which didn't originally have a fix for #1811. But then, once I spotted the problem, fixing it was actually quite easy: just defer st.RUnlock() in the in-memory storage and it would just work :-)

Could you give it a try? I think that the solution there would be preferable to this one here. What do you think?

By the way, the merge conflicts to this one weren't caused by me :-)

chris-smith-zocdoc · 2020-09-24T16:21:45Z

@jpkrohling that may be one source of the deadlocks but it is not the only one. I tried with just that change and my tests still deadlock you can view it here https://github.com/open-telemetry/opentelemetry-collector/compare/master...Zocdoc:issue_1811_storage_fix_only?expand=1

I'll update this pr to include the defer fix also, but I believe this more invasive change is necessary to address the other deadlock I mentioned in the pr description/comments on the issue.

I'm in the process of fixing the merge conflicts

chris-smith-zocdoc · 2020-09-24T16:29:31Z

Here is a small example that illustrates what is happening more clearly. This test will deadlock

func TestExampleDeadlock(t *testing.T) {
	c := make(chan struct{}, 1)

	// 1 item in channel, buffer is now full
	c <- struct{}{}

	for _ = range c { // dequeue first item, buffer empty
		log.Printf("enquing first item from consumer")
		// enqueue, buffer full now
		c <- struct{}{}
		log.Printf("success")

		log.Printf("enquing second item from consumer")
		// deadlock because the consumer is trying to produce events on a full buffer
		c <- struct{}{}
		// will never reach this line
		log.Printf("success")
		break
	}

	assert.True(t, true)
}

…d by the groupbytrace processor There are only two events now, one for adding spans and one for releasing the trace. This removes the prior cases where the processor would need to emit events during the processing of an event, which would trigger a deadlock when the buffer was full. This change also modifies the behavior for incomplete evicted traces, these traces are now passed to the downstream processors early instead of discarded.

…sor with a very short wait duration

…detection can be used Stabilize the tests by using synchronization rather than time.sleep calls Wait on inprogress goroutines when shutting the processor down

codecov · 2020-09-25T01:14:43Z

Codecov Report

Merging #1847 into master will decrease coverage by 0.08%.
The diff coverage is 67.56%.

@@            Coverage Diff             @@
##           master    #1847      +/-   ##
==========================================
- Coverage   91.15%   91.07%   -0.09%     
==========================================
  Files         272      272              
  Lines       16256    16230      -26     
==========================================
- Hits        14818    14781      -37     
- Misses       1010     1016       +6     
- Partials      428      433       +5

Impacted Files	Coverage Δ
processor/groupbytraceprocessor/event.go	`88.23% <61.53%> (-11.77%)`	⬇️
processor/groupbytraceprocessor/processor.go	`93.45% <66.66%> (-6.55%)`	⬇️
processor/groupbytraceprocessor/storage_memory.go	`100.00% <100.00%> (ø)`
translator/internaldata/resource_to_oc.go	`89.36% <0.00%> (+2.12%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8690937...eab394b. Read the comment docs.

tigrannajaryan · 2020-09-29T15:46:28Z

@jpkrohling please review this PR. Also, friendly reminder, you agreed to move groupbytrace to contrib :-)

jpkrohling · 2020-09-29T15:47:55Z

@tigrannajaryan, I believe @chris-smith-zocdoc will close this in favor of #1842 :-) You can assign his other PR to me, though: #1864

chris-smith-zocdoc · 2020-09-29T16:18:36Z

Closing in favor of #1842

Will open new issues in the contrib repo once that is moved

Signed-off-by: Hu Shuai <[email protected]>

chris-smith-zocdoc requested review from bogdandrutu, dmitryax, james-bebbington, nilebox, owais, pjanotti and tigrannajaryan as code owners September 24, 2020 15:08

chris-smith-zocdoc mentioned this pull request Sep 24, 2020

Fix Deadlock in groupbytrace #1846

Closed

chris-smith-zocdoc mentioned this pull request Sep 24, 2020

Deadlock in groupbytrace processor #1811

Closed

chris-smith-zocdoc added 5 commits September 24, 2020 13:47

Add test case to reproduce deadlock open-telemetry#1811

2ac646a

groupbytrace: Remove shutdown mutex, replace it with a channel

6b75e14

Add another test case for high concurrency in the groupbytrace proces…

20c2ae0

…sor with a very short wait duration

change log level per guidelines

ab9f1c2

chris-smith-zocdoc force-pushed the issue_1811 branch from 820ad63 to ab9f1c2 Compare September 24, 2020 18:00

chris-smith-zocdoc added 4 commits September 24, 2020 19:52

Reduce the number of goroutines creating during the test so the race …

ea56720

…detection can be used Stabilize the tests by using synchronization rather than time.sleep calls Wait on inprogress goroutines when shutting the processor down

Modify the groupbytrace storage_memory to defer its mutex unlocks

950f060

Remove data race from test

f6c3d2f

lint

eab394b

tigrannajaryan assigned jpkrohling Sep 29, 2020

chris-smith-zocdoc closed this Sep 29, 2020

MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this pull request Nov 11, 2021

Fix golint issue caused by typo (open-telemetry#1847)

8eb0bb9

Signed-off-by: Hu Shuai <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Deadlock in groupbytrace #1847

Fix Deadlock in groupbytrace #1847

chris-smith-zocdoc commented Sep 24, 2020 •

edited

Loading

jpkrohling commented Sep 24, 2020 •

edited

Loading

chris-smith-zocdoc commented Sep 24, 2020

chris-smith-zocdoc commented Sep 24, 2020

codecov bot commented Sep 25, 2020

tigrannajaryan commented Sep 29, 2020

jpkrohling commented Sep 29, 2020 •

edited

Loading

chris-smith-zocdoc commented Sep 29, 2020

Fix Deadlock in groupbytrace #1847

Fix Deadlock in groupbytrace #1847

Conversation

chris-smith-zocdoc commented Sep 24, 2020 • edited Loading

jpkrohling commented Sep 24, 2020 • edited Loading

chris-smith-zocdoc commented Sep 24, 2020

chris-smith-zocdoc commented Sep 24, 2020

codecov bot commented Sep 25, 2020

Codecov Report

tigrannajaryan commented Sep 29, 2020

jpkrohling commented Sep 29, 2020 • edited Loading

chris-smith-zocdoc commented Sep 29, 2020

chris-smith-zocdoc commented Sep 24, 2020 •

edited

Loading

jpkrohling commented Sep 24, 2020 •

edited

Loading

jpkrohling commented Sep 29, 2020 •

edited

Loading