-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade SDK to 6.0.100-rc.1.21411.13 #57143
Conversation
Both the AzDO and the Core-Eng team believe believe that the issue is on our side and was caused by a thread pool regression. The assumption stands that we need to update to a newer SDK which contains the fix for the thread pool hang. Pros: - AzDO and Core-Eng believe that this will mitigate the AzDO feed restore issues. Cons: - We will upgrade to an unsigned SDK build. Arcade and other repos already did the same to workaround the issue. - That SDK build isn’t officially released and won’t until RC 1 ships. This means that developers need to install that build via the nightly channel (from https://github.com/dotnet/installer) if they want to use their globally installed SDK in combination with dotnet/runtime. - Even though this is a breaking change, we can’t wait for the next monthly infrastructure rollout.
Meaning #56346? If that's true, code somewhere is doing sync-over-async blocking waiting for a task. Do they know where? Can we fix that as well? |
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @dotnet/runtime-infrastructure Issue DetailsBoth the AzDO and the Core-Eng team believe believe that the issue is on our side and was caused by a thread pool regression. The assumption stands that we need to update to a newer SDK which contains the fix for the thread pool hang. Pros:
|
My guess would be the sync-over-async is somewhere near the border of MSBuild and NuGet since MSBuild tasks can't be async. |
Yes, and "normal" types of failure (HTTP 429, 503, etc) are likely still worth pursuing with them, since they'll have telemetry. After much investigation though, it seemed like the connections weren't even getting through, and after a lot of chatting with @wfurt , a change from @kouvel seemed relevant. Empirically ASP has had significant improvement from taking the new SDK as well. ( Edit: I think if you're wanting to follow up with a partner team, it's the nuget client, not Azure Devops, for their usage of .NET APIs) |
cc @johnterickson (AzDO packaging) EDIT: okt it seems like this isn't AzDO related and most likely a client issue on NuGet's side. Sorry for the cc John. |
@steveisok @akoeplinger who would be the best person to look into this issue in the wasm legs? |
|
@stephentoub we would need to approach the product team that we believe own the faulty code. If we believe that's NuGet, we should figure out how to approach them. One way I could imagine would be to help them reviewing their code to find such patterns. |
I think you may be looking to engage with the last few git-blame-folks from https://github.com/NuGet/NuGet.Client/blob/0ae80f5495c8d6fad2ec6e2c708727675ff08fb6/src/NuGet.Core/NuGet.Protocol/HttpSource/HttpSource.cs#L321-L350 , though it's very infrequently updated. |
@ViktorHofer the issue is #56974 (comment) I'll pull the changes I added there here |
Do we have a dump from any of the processes that hung? |
@lambdageek why are browser AOT/EAT lanes failing on hotreload here? |
@lewing Did something change about how wasm tests are built& executed in rc1? Two possibilities:
|
@@ -5,6 +5,7 @@ | |||
<AspNetCoreHostingModel>InProcess</AspNetCoreHostingModel> | |||
<OutputType>Exe</OutputType> | |||
<EnableDefaultCompileItems>false</EnableDefaultCompileItems> | |||
<StaticWebAssetsEnabled>false</StaticWebAssetsEnabled> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @javiercn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lewing thanks for the heads up. It's fixed in dotnet/sdk#19482
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's fixed in the SDK do we need to remove these now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I'll revert the workarounds
It does seem to be passed to the app:
This was a recent
This? <ItemGroup Condition="'$(Configuration)' == 'Debug' and '@(_MonoComponent->Count())' == 0">
<_MonoComponent Include="hot_reload;debugger" />
</ItemGroup> |
Ah, but the EAT/AOT tests look like they're Release configuration. I set the runtime/src/libraries/System.Runtime.Loader/tests/ApplyUpdate/Directory.Build.props Lines 16 to 18 in 53b7ba9
maybe we should set |
that's the same for the regular tests too.
Are runtime components used with these tests at all? IIUC, they are available only through workload packs. And looking at the binlog, I don't see any Either way, what changed in this PR that is causing the failure? Also, (not necessarily in this PR) do we need to add some public item to allow setting/overriding these components? |
As the rest of the stack already moved onto the RC1 SDK we want to get this in asap. Otherwise we will be stuck with outdated versions of our dependencies, i.e. hotreload-utils (#56974 (comment)). |
I think we should just delete the IsSupported2 test . It's calling |
I've removed IsSupported2 lets see how things look now. |
I'm not sure what is causing the coreclr failure but they seem to be choking on some questionable command line + argument parsing. cc @danmoseley |
About to push an update to the test that should handle this (I had been testing on Linux, but it appears this only fails on Windows). That did point out that the logic in On non-Windows, this is saving only I'll push the update in a sec. |
After that @agocke we still have AppHostUsedWithSymbolicLinks.Put_app_directory_behind_symlink_and_use_dotnet_run. Could you please ask someone to take a look at that or disable it? Ignoring the xunit crash, that might be it. I see a JIT failure with no log: we will see if it happens again. |
I didn't pull down the dump for the JSON crash, which seems sporadic. If I have time later today I will try. |
@dotnet/dnceng what could cause there to be only a single line in the console output? ( Seems a JIT test failed but there is no info. https://dev.azure.com/dnceng/public/_build/results?buildId=1290308&view=ms.vss-test-web.build-test-results-tab&runId=38085910&resultId=103042&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab |
Taking a peek |
@danmoseley I am trying to get on the machine to directly check, but I think what happened is the taskkill.exe call that is line 1 on the execute.cmd crashed it with -1073741502 (STATUS_DLL_INIT_FAILED) faster than it could record std output, which would indicate a funky machine. I'll continue to investigate, but I'd also add you should now be able to safely remove the taskkill invocations entirely, as we now have reliable leaked process cleanup in the Helix client. |
@agocke here's more context for the failing test showing the similar passing tests.
|
I'll take a look -- I didn't see that failure the first time |
Without wanting to put too much pressure on us, Arcade's test retry feature is currently blocked on getting this in. So ideally we get this in today so that we can update our dependencies later today or tomorrow. |
I've removed the test -- it's dependent on SDK-specific functionality ( |
@dotnet/jit-contrib could one of you comment on this? #57143 (comment) More importantly, are you comfortable with us merging with that failure, or should we resolve it first? We are pushing on this SDK ingestion because it should improve infra reliability. |
Execute.cmd is auto-generated by the Arcade SDK. Check out HelixPre/PostCommands in your projects. e.g. |
Right, https://github.com/dotnet/runtime/search?q=taskkill lists the ones that Matt just posted.
Sounds like that was a faulty machine state and likely isn't related to this change. |
While it does still seem like a weird crash happened, whatever it was didn't reproduce with the same payload on the same machine (I really wanted to make sure since I'm pushing this update). It does use the updated SDK for test execution, but I saw no evidence this is caused by the update. |
I see a couple Installer issues just popped up. @agocke can you please take a look? |
@ViktorHofer These weren't in the previous runs. Did something change? |
Looks like the tests passed and the crash happened later? I'd be fine merging with this. |
Thanks folks for the team effort merging this, let's see whether it helps CI stability. |
Amazing 👋👋👋 |
Both the AzDO and the Core-Eng team believe believe that the issue is on our side and was caused by a thread pool regression. The assumption stands that we need to update to a newer SDK which contains the fix for the thread pool hang.
Pros:
Cons: