Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roslyn failing to clean up listeners #48816

Closed
runfoapp bot opened this issue Oct 21, 2020 · 11 comments
Closed

Roslyn failing to clean up listeners #48816

runfoapp bot opened this issue Oct 21, 2020 · 11 comments
Assignees
Labels
4 - In Review A fix for the issue is submitted for review. Area-IDE Test Test failures in roslyn-CI
Milestone

Comments

@runfoapp
Copy link

runfoapp bot commented Oct 21, 2020

Runfo Tracking Issue: Roslyn failing to clean up listeners

Build Definition Kind Run Name

Build Result Summary

Day Hit Count Week Hit Count Month Hit Count
0 0 0
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added Area-Infrastructure untriaged Issues and PRs which have not yet been triaged by a lead labels Oct 21, 2020
@sharwell
Copy link
Member

This problem will go away as soon as either #44514 or #48662 is merged. I would consider #44514 to be the "true" fix for it.

@sharwell sharwell added 4 - In Review A fix for the issue is submitted for review. Test Test failures in roslyn-CI and removed untriaged Issues and PRs which have not yet been triaged by a lead labels Oct 21, 2020
@sharwell sharwell self-assigned this Oct 21, 2020
@jaredpar
Copy link
Member

One of those PRs has been open for five months, the other six days. If we can't merge a fix today then we need to push foward with disabling the assert. This is the number 1 blocker for our builds. It's been a known issue for sometime and hasn't gotten the necessary attention.

@jaredpar
Copy link
Member

Having a little issue with the live stats function but you can see the damage here.

https://runfo.azurewebsites.net/search/tests/?bq=definition%3Aroslyn-ci++started%3A%7E7&tq=message%3A%22Failed+to+clean+up+listeners%22

@sharwell
Copy link
Member

sharwell commented Oct 21, 2020

Historically there were many other causes for this. Here are some of the other PRs that helped:

#48077
#47995
#47994

It was difficult to demonstrate the need for #44514 when more acute problems remained, but now that it's clear and the complex ConfiguredYieldAwaitable was reviewed as a separate change, it's set to auto-merge and should be merged today.

@jaredpar
Copy link
Member

The live issue is now fixed so stats are updating. The data only goes back ~15 days for error messages though. I didn't start tracking that until yesterday and only ran a back fill push for last 200 or so builds.

@jaredpar
Copy link
Member

@sharwell the PR you listed has merged, the error still persists. This is still our #1 failure in Roslyn. It's holding back all our other work.

@jaredpar
Copy link
Member

Part of the issue here appears to be that the code is using a one minute timeout to gate cancellation. The IDE team also made a change to CI to over utilize cores on all of our test legs. That isn't a reasonable combination. The timeout can and will get hit for normal starvation reasons on a regular basis. We need to pull this timeout entirely if we want to have stable CI

@sharwell
Copy link
Member

sharwell commented Oct 27, 2020

Part of the issue here appears to be that the code is using a one minute timeout to gate cancellation

Tests are expected to cancel all pending operations before the end of the test. The cleanup code has fast-path handling for this case to guarantee the timeout will not come into play for correctly written code. All failures of this check are either a true product bug or a true test bug for the test that timed out.

Some known product bugs remain. For example, #44522 is a case where we fail to cancel asynchronous operations when the owning object is disposed. We just haven't reached a point where we could demonstrate that particular code path was responsible for a failure because the investigations keep revealing other cases that we fix.

@jaredpar
Copy link
Member

So how do we move forward here? This is the primary issue causing instability in CI.

My instinct is to say that we should delete the assert until we get fixes for the underlyin gproduct bugs.

@sharwell
Copy link
Member

sharwell commented Oct 27, 2020

The most recent two changes that are likely to reduce occurrences of this are #48662 (c8eecdb) and #44514 (0b6ff9b). I need a list of failures which included both of these changes to investigate what remains.

@jinujoseph jinujoseph added this to the 16.9 milestone Oct 27, 2020
@jinujoseph jinujoseph modified the milestones: 16.9, 16.10 Mar 28, 2021
@jinujoseph jinujoseph modified the milestones: 16.10, Backlog Jul 16, 2021
@CyrusNajmabadi
Copy link
Member

Closing out due to lack of movement in 4 years. IF we need to do something here, we shoudl do it.

@CyrusNajmabadi CyrusNajmabadi closed this as not planned Won't fix, can't repro, duplicate, stale Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - In Review A fix for the issue is submitted for review. Area-IDE Test Test failures in roslyn-CI
Projects
None yet
Development

No branches or pull requests

5 participants