-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Win64 test failures due to #9266 #9366
Comments
Possibly a clue? If I run the numbers test with
|
i'm not sure why your julia-debug is picking up sys.dll, that's an error and will definitely cause serious issues since i'm having issues locally with my build, can you try adding a noinline annotation to the |
but definitely, from reading the assembly, save_stack isn't saving enough of the stack on win64 when it gets inlined |
You mean like this Lines 251 to 255 in e93672d
save_stack ? Sure, trying that now.
|
Very nice, looks like that worked locally. Should I commit it or do you want to? Anything else to be worried about for now? |
That should be it. I'm away from a computer until late. If I commit, i would also trim the base offset (remove the 0x10 and reduce the 0x40 back to 0x20), but I can do that in a separate commit. AppVeyor is helpful here. In the past, I've knowingly left that build broken for weeks until someone finally complained that I needed to complete and push my local changes :). And this one I wouldn't have even have know was broken. The noinline change also needs to be backported |
Let's see if appveyor agrees with my local machine. Thanks for figuring out how to fix this! |
I'm violating my usual policy of wanting to have something on master for at least a few days before backporting, done now in 1588014. We should be testing release-0.3 fairly actively over the next week, so if something comes up either there or on master we should have time to fix it. |
Damn, appveyor's still unhappy. https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.597/job/gc9in944d1toas7g |
Reproduced on my local machine, but only if I do
|
Is there perhaps a mismatch in the register load/store locations in the setjmp/longjmp code or a register missing from that list that is supposed to be callee saved according to the win64 calling convention? |
Do your machines have YMM registers? |
My laptop is a Sandy Bridge, so I think so. I should throw in a |
Can you force the tests to pass by setting JULIA_CPU_TARGET to something older? |
Nope, julia/contrib/windows/msys_build.sh Line 41 in 592f3d5
|
Does it work locally if you disable the inline assembly version (in task.jl)? |
Can you run the tests in gdb and get a backtrace and observe the pointers getting passed to task_done_hook? |
Do you mean |
Yes that one. It nice when someone can interpret for me. It's always bad when I open the wrong one and can't find the expected declarations at all. |
Sometimes. But when you say stuff like:
there's a distinct whooshing sound. Disabling |
Try disabling the stack unwind code also (rtlstackwalk). It might be getting confused |
Not sure if I did this right. With this diff
it fails the backtrace, profile, and meta tests, but if I remove those from
|
@tkelman are you on IRC now? have you managed to get it to fail in gcc (put a breakpoint on the "fatal: error thrown" line and record the backtrace? |
gdb log from irc, so far https://gist.github.com/tkelman/dcbcac708c188ae308e3 |
proposed patch. from said irc session. diff --git a/src/init.c b/src/init.c
index 2681e8e..01c5b1f 100644
--- a/src/init.c
+++ b/src/init.c
@@ -232,7 +232,8 @@ void restore_signals(void)
void jl_throw_in_ctx(jl_value_t *excpt, CONTEXT *ctxThread, int bt)
{
assert(excpt != NULL);
- bt_size = bt ? rec_backtrace_ctx(bt_data, MAX_BT_SIZE, ctxThread) : 0;
+ CONTEXT Ctx = *ctxThread;
+ bt_size = bt ? rec_backtrace_ctx(bt_data, MAX_BT_SIZE, &Ctx) : 0;
jl_exception_in_transit = excpt;
#if defined(_CPU_X86_64_)
ctxThread->Rip = (DWORD64)&jl_rethrow; |
it should also now be possible to trim the stack back down to the bare necessity: diff --git a/src/task.c b/src/task.c
index 245638a..a1c4165 100644
--- a/src/task.c
+++ b/src/task.c
@@ -346,10 +346,10 @@ static void ctx_switch(jl_task_t *t, jl_jmp_buf *where)
restore_stack(t, where, NULL);
} else {
#ifdef ASM_COPY_STACKS
- void *stackbase = jl_stackbase - 0x10;
+ void *stackbase = jl_stackbase;
#ifdef _CPU_X86_64_
#ifdef _OS_WINDOWS_
- stackbase -= 0x40;
+ stackbase -= 0x20;
#endif
asm(" movq %0, %%rsp;\n"
" xorq %%rbp, %%rbp;\n" |
yes, that was certainly possible. what was happening with the old logic is we were unwinding the stack before choosing our RSP pointer, so we were picking the top of the stack to place our jl_rethrow frame, rather than the bottom. that meant it could overwrite meaningful frames with garbage and potentially confuse the stack unwinder. with my recent changes, the top of the stack was now a meaningful stack frame from julia and not some internal bootstrapping frame, so we could see the error more prominently. |
maybe the underlying cause of #9185 also? |
I'll try to remember to test that (undo my mitigating change to the suitesparse compilation flags) when we get around to backporting the fix, since it was only happening on recent release-0.3. |
Nah, looks like #9185 is still a problem if I build SuiteSparse with
|
pass a stack-copy of ctxThread in jl_throw_in_ctx to rec_backtrace_ctx (cherry picked from commit 0b44aca) fix #7395, and copy just the useful parts of CONTEXT in jl_throw_in_ctx, before rec_backtrace_ctx destroys the recorded stack pointer (cherry picked from commit ed5a8fc) Conflicts: src/init.c fix 32 bit oops from ed5a8fc (cherry picked from commit a2ff1ea) fix compiler warning on win32 (cherry picked from commit 0050b72)
cc @vtjnash opening this to centralize discussion on what's wrong and how to fix it. On AppVeyor the current status is a MemoryError in the parallel test. Locally, a few different things are happening.
numbers test dies due to SIGILL (I think the SIGFPE from gmp is normal):
https://gist.github.com/tkelman/36e33256183f54284780
spawn test segfaults (can't get gdb to tell me anything more), but only with sys.dll present:
https://gist.github.com/tkelman/a49de09a22146915aa25
parallel test segfaults similarly (also only when sys.dll is present): https://gist.github.com/tkelman/bac065e719cad36b527d
Note that if you don't have an LLVM 3.3 Win64 build handy, you can do a separate fresh clone and run the same script as on AppVeyor,
env ARCH=x86_64 contrib/windows/msys_build.sh
.It looks like part of the answer might be delete sys.dll until we're ready with llvm 3.5. That hasn't been necessary for tests to pass on AppVeyor so far, and there's still the issue with the numbers test even without sys.dll present.
The text was updated successfully, but these errors were encountered: