Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Runtime.Intrinsics.Tests crashing on Linux x64 #83917

Closed
jkotas opened this issue Mar 25, 2023 · 14 comments · Fixed by #83922 or #83927
Closed

System.Runtime.Intrinsics.Tests crashing on Linux x64 #83917

jkotas opened this issue Mar 25, 2023 · 14 comments · Fixed by #83922 or #83927
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Comments

@jkotas
Copy link
Member

jkotas commented Mar 25, 2023

/datadisks/disk1/work/B226096B/w/B20309D9/e /datadisks/disk1/work/B226096B/w/B20309D9/e
  Discovering: System.Runtime.Intrinsics.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Intrinsics.Tests (found 1023 test cases)
  Starting:    System.Runtime.Intrinsics.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 168: 15658 Segmentation fault      (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Runtime.Intrinsics.Tests.runtimeconfig.json --depsfile System.Runtime.Intrinsics.Tests.deps.json xunit.console.dll System.Runtime.Intrinsics.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/datadisks/disk1/work/B226096B/w/B20309D9/e
----- end Sat Mar 25 04:26:45 UTC 2023 ----- exit code 139 ----------------------------------------------------------
exit code 139 means SIGSEGV Illegal memory access. Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped.

Full log: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-83881-merge-e7a7337a0eee400295/System.Runtime.Intrinsics.Tests/1/console.edf1d97e.log?helixlogtype=result

Failed in #83881 (failing in all PRs that run this leg)

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 25, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Mar 25, 2023
@jkotas
Copy link
Member Author

jkotas commented Mar 25, 2023

Null pointer dereference at:

 # Call Site
00 libclrjit!LinearScan::RegisterSelection::try_SPILL_COST
01 libclrjit!LinearScan::RegisterSelection::select
02 libclrjit!LinearScan::allocateReg
03 libclrjit!LinearScan::allocateRegisters
04 libclrjit!LinearScan::doLinearScan
05 libclrjit!Compiler::compCompile::$_2::operator()
06 libclrjit!ActionPhase<(lambda at /__w/1/s/src/coreclr/jit/compiler.cpp:5097:28)>::DoPhase
07 libclrjit!Phase::Run
08 libclrjit!DoPhase<(lambda at /__w/1/s/src/coreclr/jit/compiler.cpp:5097:28)>
09 libclrjit!Compiler::compCompile
0a libclrjit!Compiler::compCompileHelper
0b libclrjit!Compiler::compCompile::$_3::operator()
0c libclrjit!Compiler::compCompile
0d libclrjit!jitNativeCode::$_5::operator()::{lambda(jitNativeCode(CORINFO_METHOD_STRUCT_ *, CORINFO_MODULE_STRUCT_ *, ICorJitInfo *, CORINFO_METHOD_INFO *, void **, unsigned int *, JitFlags *, void *)::$_5::operator()(jitNativeCode(CORINFO_METHOD_STRUCT_ *, CORINFO_MODULE_STRUCT_ *, ICorJitInfo *, CORINFO_METHOD_INFO *, void **, unsigned int *, JitFlags *, void *)::__JITParam *)::__JITParam *)#1}::operator()
0e libclrjit!jitNativeCode::$_5::operator()
0f libclrjit!jitNativeCode
10 libclrjit!CILJit::compileMethod
11 libcoreclr!invokeCompileMethodHelper
12 libcoreclr!invokeCompileMethod
13 libcoreclr!UnsafeJitFunction
14 libcoreclr!MethodDesc::JitCompileCodeLocked
15 libcoreclr!MethodDesc::JitCompileCodeLockedEventWrapper
16 libcoreclr!MethodDesc::JitCompileCode
17 libcoreclr!MethodDesc::PrepareILBasedCode
18 libcoreclr!CodeVersionManager::PublishVersionableCodeIfNecessary
19 libcoreclr!MethodDesc::DoPrestub
1a libcoreclr!PreStubWorker
1b libcoreclr!ThePreStub
1c libcoreclr!CallDescrWorkerInternal
1d libcoreclr!CallDescrWorkerWithHandler
1e libcoreclr!RuntimeMethodHandle::InvokeMethod

@jkotas jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 25, 2023
@ghost
Copy link

ghost commented Mar 25, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details
/datadisks/disk1/work/B226096B/w/B20309D9/e /datadisks/disk1/work/B226096B/w/B20309D9/e
  Discovering: System.Runtime.Intrinsics.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Runtime.Intrinsics.Tests (found 1023 test cases)
  Starting:    System.Runtime.Intrinsics.Tests (parallel test collections = on, max threads = 2)
./RunTests.sh: line 168: 15658 Segmentation fault      (core dumped) "$RUNTIME_PATH/dotnet" exec --runtimeconfig System.Runtime.Intrinsics.Tests.runtimeconfig.json --depsfile System.Runtime.Intrinsics.Tests.deps.json xunit.console.dll System.Runtime.Intrinsics.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing $RSP_FILE
/datadisks/disk1/work/B226096B/w/B20309D9/e
----- end Sat Mar 25 04:26:45 UTC 2023 ----- exit code 139 ----------------------------------------------------------
exit code 139 means SIGSEGV Illegal memory access. Deref invalid pointer, overrunning buffer, stack overflow etc. Core dumped.

Full log: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-83881-merge-e7a7337a0eee400295/System.Runtime.Intrinsics.Tests/1/console.edf1d97e.log?helixlogtype=result

Failed in #83881 (failing in all PRs that run this leg)

Author: jkotas
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged, needs-area-label

Milestone: -

@jkotas
Copy link
Member Author

jkotas commented Mar 25, 2023

The method being JITed is System.Runtime.Intrinsics.Tests.Vectors.Vector512Tests.Vector512Int64ExtractMostSignificantBitsTest().

cc @tannergooding

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 25, 2023
@tannergooding
Copy link
Member

CC. @kunalspathak since this is in LSRA

@tannergooding
Copy link
Member

Working on getting a local repro. Based on the failing method, this is probably some issue with the KMASK registers.

@tannergooding
Copy link
Member

Looks to be an issue that only repros with a release clrjit. It doesn't repro with the debug/checked clrjit.
-- Confirmed this isn't due to anything like the build configuration of S.P.Corelib or coreclr.dll

@tannergooding
Copy link
Member

tannergooding commented Mar 25, 2023

We're seeing physRegRecord->assignedInterval is nullptr. This is for a reg=REG_K1, type=TYP_MASK so it is related to KMASK as suspected.

image

@tannergooding
Copy link
Member

Looks like part of this may come down to incomplete handling of the mask register type.

We in general have many places that do varTypeUsesFloatReg and then assume false means integer. We'll probably want to add a varTypeUsesIntReg and make it the default (most common), then fallback to checking varTypeUsesFloatReg on xarch and asserting it on other platforms. On xarch we can then have varTypeUsesMaskReg be the last path.

It's not clear what's causing this to "work" on debug/release JITs yet. Still working on figuring out which #if DEBUG path is allowing things to "work"

@ghost ghost removed in-pr There is an active PR which will close this issue when it is merged untriaged New issue has not been triaged by the area owner labels Mar 25, 2023
@tannergooding
Copy link
Member

Going to keep this open until the underlying issue is resolved such that EVEX can be enabled by default

@tannergooding tannergooding reopened this Mar 25, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Mar 25, 2023
@tannergooding
Copy link
Member

tannergooding commented Mar 25, 2023

debug and checked are working because there is a path here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lsra.cpp#L12046

This basically does an early exit of IF_FOUND_GOTO_DONE which causes it to skip the try_* for the various lsra_score.h entries.

In release, we have no freeCandidates because resetAvailableRegs doesn't include availableMaskRegs. This is in addition to potential other issues related to mask registers not having their proper support "in general"

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 25, 2023
@tannergooding
Copy link
Member

Put up #83927

@SingleAccretion SingleAccretion removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Mar 25, 2023
@kunalspathak
Copy link
Member

kunalspathak commented Mar 26, 2023

debug and checked are working because there is a path here:

That path exists even for release in

IF_FOUND_GOTO_DONE
and
IF_FOUND_GOTO_DONE
but I recently found out while working on #80297 that we extract the asssignedInterval and use it and later have code that checks for it being nullptr or not.
RefPosition* recentRefPosition = assignedInterval != nullptr ? assignedInterval->recentRefPosition : nullptr;

This usually happens when we have free candidates but still execute busy register selection heusitics methods. I was running into it for consecutive registers cases and I could see why it was hitting for mask register because we were not adding that in the availableregisters. In #80297, I have put up ae2e633 for it. However, here, we should at least add an assert(assignedInterval != nullptr) so we could catch cases early in checked/debug builds.

@tannergooding
Copy link
Member

yes both paths have an IF_FOUND_GOTO_DONE

The difference is DEBUG will always do it first. That is, REGSELECT_HEURISTIC_COUNT is defined to be 17 today:

#define REGSELECT_HEURISTIC_COUNT 17

That means this loop always executes at least once:

runtime/src/coreclr/jit/lsra.cpp

Lines 12044 to 12068 in 95dbfc2

#ifdef DEBUG
HeuristicFn fn;
for (int orderId = 0; orderId < REGSELECT_HEURISTIC_COUNT; orderId++)
{
IF_FOUND_GOTO_DONE
RegisterScore heuristicToApply = RegSelectionOrder[orderId];
if (mappingTable->Lookup(heuristicToApply, &fn))
{
(this->*fn)();
if (found)
{
*registerScore = heuristicToApply;
}
#if TRACK_LSRA_STATS
INTRACK_STATS_IF(found, linearScan->updateLsraStat(linearScan->getLsraStatFromScore(heuristicToApply),
refPosition->bbNum));
#endif // TRACK_LSRA_STATS
}
else
{
assert(!"Unexpected heuristic value!");
}
}

and the first thing that loop does is IF_FOUND_GOTO_DONE

Release on the other hand always executes some try_* method first and that's where the failure was occurring because there were no tracked mask registers as available

@kunalspathak
Copy link
Member

Realized that this was uncovered from my #83569 where I am skipping the free register selection heuristics and directly going to the busy selection heuristics.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 27, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Mar 27, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Apr 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
4 participants