-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure tracing\\eventpipe\\pauseonstart\\pauseonstart\\pauseonstart.cmd #54469
Comments
Here is the actual error message:
This looks to be some kind of AV in the class loader? |
Adding @janvorli |
This looks like some kind of memory corruption. The crash happens in the following loop: runtime/src/coreclr/utilcode/loaderheap.cpp Lines 2168 to 2174 in 213600c
The pBlock at the time of crash is NULL , but it is never supposed to be NULL , the last element in the list should have m_pNext pointing to the m_FirstBlock . The m_pNext is always assigned the m_pFirstBlock when a new block is added and the m_pFirstBlock is set to point to the newly added block. So there is no legit way of setting that to NULL .
Unfortunately it is not possible to see what the linked list looked like before the crash, as the loop we are crashing in destroys the blocks one by one, so the previous blocks leading to the one with the |
Interesting. Thanks, Jan. Looks like the Helix run has dumps for this failure, so I'll try to look when I can get access to a Windows machine. As far as I can tell, this has only happened in this run (according to the AzDO history). |
Failed again in runtime-coreclr outerloop 20210628.5 Failed test:
Error message:
|
I have this looping on my machine and maybe we'll get a repro. If not my plan is to add some extra instrumentation so we can catch the failure before it has deleted the linked list and then wait for CI to hit it again. |
@noahfalk I looked over the past 30 days and found no such failing test |
Thanks for taking a look @hoyosjs! I've still got the test looping with no repro yet. I'm going to leave it running overnight and assuming it still hasn't reproed tomorrow I can PR in a bit of instrumentation + resolve the bug for now. We may never see it again, but if it does show up in the future we'll have a few more clues to work with. |
We are seeing some likely memory corruption in dotnet#54469 and these changes hopefully will help better diagnose it
My overnight run of this test in a loop did not repro the problem. |
Failed again in runtime-coreclr jitstress-isas-x86 20210731.1 Failed test:
Error message:
|
Link to dump
We are trying to do ExternalMethodFixup on System.Number.UInt32ToDecStr(UInt32), MethodDef=0x6001513
Loading System.Number, TypeDef 0x2000175
The instrumentation is still in PR so unfortunately other than another data point for what type is being loaded we don't have any more info on the corrupted memory. |
We are seeing some likely memory corruption in #54469 and these changes hopefully will help better diagnose it
I went ahead and merged #56648 so that we'll get more clues the next time this fails. Given that we got a repro recently maybe we'll get another so leaving the issue open a few days. If we don't see it soon then I'll close and wait for another hit because there is little we can do until then. |
I've looked at the core dump. It looks like a memory corruption of the memory that the m_pFirstBlock points to. It points to something that seems to be a valid heap allocated block, but its contents doesn't make sense for the AllocMemTrackerBlock. Looking at the memory around that structure, there is an interesting pattern - it is filled with 32 bit numbers incremented by one:
The m_pFirstBlock points to 0x32694e8 I wonder if those values might ring a bell.. |
Based on the beginning of that call stack:
The disable for this test shouldn't have any filter_data, but is it possible this some heap corruption being caused by the issue I fixed in #56104? It seems unlikely. Odd to find an array of sequential numbers like that in memory. They don't look familiar. There does appear to a memory address at 0x 03269504+0xc. |
Sigh. My query was wrong (odd parsing of the names of coreclr wrappers). TestResults
| where Result == "Fail" and Type contains "pauseonstart"
| join (WorkItems | where Queued > ago(30d)) on JobName, $left.WorkItemName == $right.Name
| join Jobs on $left.JobName == $right.Name |
Looks line no tiered compilation Winx86 and linux x64 are most likely to repro. |
I'll be out on vacation so reassigning to @josalem |
The existing failures are R2R infra and android installation issues. Closing until new occurrences of this resurface. |
Run: runtime-coreclr outerloop 20210618.10
Failed test:
Error message:
The text was updated successfully, but these errors were encountered: