Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Deadlock in groupbytrace #1847

Closed
wants to merge 9 commits into from

Conversation

chris-smith-zocdoc
Copy link
Contributor

@chris-smith-zocdoc chris-smith-zocdoc commented Sep 24, 2020

Description:
Fixes #1811

The deadlock in the groupbytrace processor was caused by the consumer goroutine (the one consuming from the event machine) trying to write data to the channel. This would work sometimes because of the channel buffer but deadlock once the buffer became full. I've refactored the design so that the event machine is much simpler and only has two events traceReceived and traceExpired. By doing this I've removed the cases where the reader would need to write events, removing the possibility of deadlocks. This also moved all modifications to the datastructures to happen under the consumer goroutine which fixes some other synchronization bugs with concurrent access to the storage becoming out of sync with the ringbuffer.

A second deadlock was discovered in the shutdown procedure where the channel and mutex locks could be obtained in the inverse order, leading to deadlock. I replaced the mutex with a channel and select which will avoid this.

Testing:
Added test cases

@jpkrohling

@jpkrohling
Copy link
Member

jpkrohling commented Sep 24, 2020

You've been a hero, @chris-smith-zocdoc! Your report has helped me with #1842, which didn't originally have a fix for #1811. But then, once I spotted the problem, fixing it was actually quite easy: just defer st.RUnlock() in the in-memory storage and it would just work :-)

Could you give it a try? I think that the solution there would be preferable to this one here. What do you think?

By the way, the merge conflicts to this one weren't caused by me :-)

@chris-smith-zocdoc
Copy link
Contributor Author

@jpkrohling that may be one source of the deadlocks but it is not the only one. I tried with just that change and my tests still deadlock you can view it here https://github.com/open-telemetry/opentelemetry-collector/compare/master...Zocdoc:issue_1811_storage_fix_only?expand=1

I'll update this pr to include the defer fix also, but I believe this more invasive change is necessary to address the other deadlock I mentioned in the pr description/comments on the issue.

I'm in the process of fixing the merge conflicts

@chris-smith-zocdoc
Copy link
Contributor Author

Here is a small example that illustrates what is happening more clearly. This test will deadlock

func TestExampleDeadlock(t *testing.T) {
	c := make(chan struct{}, 1)

	// 1 item in channel, buffer is now full
	c <- struct{}{}

	for _ = range c { // dequeue first item, buffer empty
		log.Printf("enquing first item from consumer")
		// enqueue, buffer full now
		c <- struct{}{}
		log.Printf("success")

		log.Printf("enquing second item from consumer")
		// deadlock because the consumer is trying to produce events on a full buffer
		c <- struct{}{}
		// will never reach this line
		log.Printf("success")
		break
	}

	assert.True(t, true)
}

…d by the groupbytrace processor

There are only two events now, one for adding spans and one for releasing the trace. This removes the prior cases where the processor would need to emit events during the processing of an event, which would trigger a deadlock when the buffer was full. This change also modifies the behavior for incomplete evicted traces, these traces are now passed to the downstream processors early instead of discarded.
…detection can be used

Stabilize the tests by using synchronization rather than time.sleep calls
Wait on inprogress goroutines when shutting the processor down
@codecov
Copy link

codecov bot commented Sep 25, 2020

Codecov Report

Merging #1847 into master will decrease coverage by 0.08%.
The diff coverage is 67.56%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1847      +/-   ##
==========================================
- Coverage   91.15%   91.07%   -0.09%     
==========================================
  Files         272      272              
  Lines       16256    16230      -26     
==========================================
- Hits        14818    14781      -37     
- Misses       1010     1016       +6     
- Partials      428      433       +5     
Impacted Files Coverage Δ
processor/groupbytraceprocessor/event.go 88.23% <61.53%> (-11.77%) ⬇️
processor/groupbytraceprocessor/processor.go 93.45% <66.66%> (-6.55%) ⬇️
processor/groupbytraceprocessor/storage_memory.go 100.00% <100.00%> (ø)
translator/internaldata/resource_to_oc.go 89.36% <0.00%> (+2.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8690937...eab394b. Read the comment docs.

@tigrannajaryan
Copy link
Member

@jpkrohling please review this PR. Also, friendly reminder, you agreed to move groupbytrace to contrib :-)

@jpkrohling
Copy link
Member

jpkrohling commented Sep 29, 2020

@tigrannajaryan, I believe @chris-smith-zocdoc will close this in favor of #1842 :-) You can assign his other PR to me, though: #1864

@chris-smith-zocdoc
Copy link
Contributor Author

Closing in favor of #1842

Will open new issues in the contrib repo once that is moved

MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this pull request Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock in groupbytrace processor
3 participants