Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Win64 test failures due to #9266 #9366

Closed
tkelman opened this issue Dec 15, 2014 · 31 comments
Closed

Win64 test failures due to #9266 #9366

tkelman opened this issue Dec 15, 2014 · 31 comments
Labels
regression Regression in behavior compared to a previous version system:windows Affects only Windows

Comments

@tkelman
Copy link
Contributor

tkelman commented Dec 15, 2014

cc @vtjnash opening this to centralize discussion on what's wrong and how to fix it. On AppVeyor the current status is a MemoryError in the parallel test. Locally, a few different things are happening.

numbers test dies due to SIGILL (I think the SIGFPE from gmp is normal):
https://gist.github.com/tkelman/36e33256183f54284780

spawn test segfaults (can't get gdb to tell me anything more), but only with sys.dll present:
https://gist.github.com/tkelman/a49de09a22146915aa25

parallel test segfaults similarly (also only when sys.dll is present): https://gist.github.com/tkelman/bac065e719cad36b527d

Note that if you don't have an LLVM 3.3 Win64 build handy, you can do a separate fresh clone and run the same script as on AppVeyor, env ARCH=x86_64 contrib/windows/msys_build.sh.

It looks like part of the answer might be delete sys.dll until we're ready with llvm 3.5. That hasn't been necessary for tests to pass on AppVeyor so far, and there's still the issue with the numbers test even without sys.dll present.

@tkelman tkelman added system:windows Affects only Windows regression Regression in behavior compared to a previous version labels Dec 15, 2014
@tkelman
Copy link
Contributor Author

tkelman commented Dec 15, 2014

Possibly a clue? If I run the numbers test with sys.dll present, but not linked to -lssp, then I get this:

$ usr/bin/julia test/runtests.jl numbers
     * numbers
ERROR: `task_done_hook` has no method matching task_done_hook(::UInt8)
 in wait at no file
while loading D:\cygwin64\home\Tony\julia\test\runtests.jl, in expression starting on line 42

@vtjnash
Copy link
Member

vtjnash commented Dec 15, 2014

i'm not sure why your julia-debug is picking up sys.dll, that's an error and will definitely cause serious issues

since i'm having issues locally with my build, can you try adding a noinline annotation to save_stack in task.c and see if that help? failing that, try also adding a large (0x80) offset to nb in save_stack?

the -lssp change is a red-herring. you are simply enabling/disabling detection of the error at an earlier point

@vtjnash
Copy link
Member

vtjnash commented Dec 15, 2014

but definitely, from reading the assembly, save_stack isn't saving enough of the stack on win64 when it gets inlined

@tkelman
Copy link
Contributor Author

tkelman commented Dec 15, 2014

You mean like this

julia/src/task.c

Lines 251 to 255 in e93672d

#if defined(_OS_WINDOWS_) && !defined(_COMPILER_MINGW_)
static void __declspec(noinline)
#else
static void __attribute__((noinline))
#endif
in front of save_stack? Sure, trying that now.

@tkelman
Copy link
Contributor Author

tkelman commented Dec 15, 2014

Very nice, looks like that worked locally. Should I commit it or do you want to? Anything else to be worried about for now?

@vtjnash
Copy link
Member

vtjnash commented Dec 15, 2014

That should be it. I'm away from a computer until late. If I commit, i would also trim the base offset (remove the 0x10 and reduce the 0x40 back to 0x20), but I can do that in a separate commit.

AppVeyor is helpful here. In the past, I've knowingly left that build broken for weeks until someone finally complained that I needed to complete and push my local changes :). And this one I wouldn't have even have know was broken.

The noinline change also needs to be backported

@tkelman
Copy link
Contributor Author

tkelman commented Dec 15, 2014

Let's see if appveyor agrees with my local machine. Thanks for figuring out how to fix this!

tkelman added a commit that referenced this issue Dec 15, 2014
@tkelman
Copy link
Contributor Author

tkelman commented Dec 15, 2014

I'm violating my usual policy of wanting to have something on master for at least a few days before backporting, done now in 1588014. We should be testing release-0.3 fairly actively over the next week, so if something comes up either there or on master we should have time to fix it.

@tkelman
Copy link
Contributor Author

tkelman commented Dec 15, 2014

@tkelman tkelman reopened this Dec 15, 2014
@tkelman
Copy link
Contributor Author

tkelman commented Dec 15, 2014

Reproduced on my local machine, but only if I do make testall1:

     * parallel
ERROR: `task_done_hook` has no method matching task_done_hook(
 in wait at no filefatal: error thrown and no exception handler available.
Makefile:9: recipe for target 'all' failed
make[1]: *** [all] Error 1
Makefile:441: recipe for target 'testall1' failed
make: *** [testall1] Error 2

@vtjnash
Copy link
Member

vtjnash commented Dec 15, 2014

Is there perhaps a mismatch in the register load/store locations in the setjmp/longjmp code or a register missing from that list that is supposed to be callee saved according to the win64 calling convention?

@vtjnash
Copy link
Member

vtjnash commented Dec 15, 2014

Do your machines have YMM registers?

@tkelman
Copy link
Contributor Author

tkelman commented Dec 15, 2014

My laptop is a Sandy Bridge, so I think so. I should throw in a versioninfo() to ask what AppVeyor's running on.

@vtjnash
Copy link
Member

vtjnash commented Dec 15, 2014

Can you force the tests to pass by setting JULIA_CPU_TARGET to something older?

@tkelman
Copy link
Contributor Author

tkelman commented Dec 15, 2014

Nope, JULIA_CPU_TARGET=core2 didn't help. I'm also setting MARCH = x86-64 on AppVeyor for consistency with how we're building binaries

echo "override MARCH = x86-64" >> Make.user

@vtjnash
Copy link
Member

vtjnash commented Dec 15, 2014

Does it work locally if you disable the inline assembly version (in task.jl)?

@vtjnash
Copy link
Member

vtjnash commented Dec 15, 2014

Can you run the tests in gdb and get a backtrace and observe the pointers getting passed to task_done_hook?

@tkelman
Copy link
Contributor Author

tkelman commented Dec 15, 2014

disable the inline assembly version (in task.jl)?

Do you mean task.c? I'm trying that first, I'll try in gdb next.

@vtjnash
Copy link
Member

vtjnash commented Dec 15, 2014

Yes that one. It nice when someone can interpret for me. It's always bad when I open the wrong one and can't find the expected declarations at all.

@tkelman
Copy link
Contributor Author

tkelman commented Dec 15, 2014

It nice when someone can interpret for me.

Sometimes. But when you say stuff like:

Is there perhaps a mismatch in the register load/store locations in the setjmp/longjmp code or a register missing from that list that is supposed to be callee saved according to the win64 calling convention?

there's a distinct whooshing sound.

Disabling ASM_COPY_STACKS is making the numbers test take a really long time.

@vtjnash
Copy link
Member

vtjnash commented Dec 15, 2014

Try disabling the stack unwind code also (rtlstackwalk). It might be getting confused

@tkelman
Copy link
Contributor Author

tkelman commented Dec 15, 2014

Not sure if I did this right. With this diff

diff --git a/src/task.c b/src/task.c
index 245638a..cca5eb1 100644
--- a/src/task.c
+++ b/src/task.c
@@ -157,9 +157,6 @@ jl_gcframe_t *jl_pgcstack = NULL;
 #ifdef COPY_STACKS
 static jl_jmp_buf * volatile jl_jmp_target;

-#if defined(_CPU_X86_64_) || defined(_CPU_X86_)
-#define ASM_COPY_STACKS
-#endif
 void *jl_stackbase;

 #ifndef ASM_COPY_STACKS
@@ -568,7 +565,7 @@ DLLEXPORT size_t rec_backtrace(ptrint_t *data, size_t maxsize)
     CONTEXT Context;
     memset(&Context, 0, sizeof(Context));
     jl_in_stackwalk = 1;
-    RtlCaptureContext(&Context);
+    //RtlCaptureContext(&Context);
     jl_in_stackwalk = 0;
     return rec_backtrace_ctx(data, maxsize, &Context);
 }

it fails the backtrace, profile, and meta tests, but if I remove those from test/runtests.jl then I get slightly different behavior from make testall1

     * examples
     * parallel
    SUCCESS
fatal: error thrown and no exception handler available.
Makefile:9: recipe for target 'all' failed
make[1]: *** [all] Error 1
Makefile:441: recipe for target 'testall1' failed
make: *** [testall1] Error 2

@vtjnash
Copy link
Member

vtjnash commented Dec 16, 2014

@tkelman are you on IRC now? have you managed to get it to fail in gcc (put a breakpoint on the "fatal: error thrown" line and record the backtrace?

@tkelman
Copy link
Contributor Author

tkelman commented Dec 16, 2014

gdb log from irc, so far https://gist.github.com/tkelman/dcbcac708c188ae308e3

@vtjnash
Copy link
Member

vtjnash commented Dec 16, 2014

proposed patch. from said irc session.

diff --git a/src/init.c b/src/init.c
index 2681e8e..01c5b1f 100644
--- a/src/init.c
+++ b/src/init.c
@@ -232,7 +232,8 @@ void restore_signals(void)
 void jl_throw_in_ctx(jl_value_t *excpt, CONTEXT *ctxThread, int bt)
 {
     assert(excpt != NULL);
-    bt_size = bt ? rec_backtrace_ctx(bt_data, MAX_BT_SIZE, ctxThread) : 0;
+    CONTEXT Ctx = *ctxThread;
+    bt_size = bt ? rec_backtrace_ctx(bt_data, MAX_BT_SIZE, &Ctx) : 0;
     jl_exception_in_transit = excpt;
 #if defined(_CPU_X86_64_)
     ctxThread->Rip = (DWORD64)&jl_rethrow;

@vtjnash
Copy link
Member

vtjnash commented Dec 16, 2014

it should also now be possible to trim the stack back down to the bare necessity:

diff --git a/src/task.c b/src/task.c
index 245638a..a1c4165 100644
--- a/src/task.c
+++ b/src/task.c
@@ -346,10 +346,10 @@ static void ctx_switch(jl_task_t *t, jl_jmp_buf *where)
             restore_stack(t, where, NULL);
         } else {
 #ifdef ASM_COPY_STACKS
-            void *stackbase = jl_stackbase - 0x10;
+            void *stackbase = jl_stackbase;
 #ifdef _CPU_X86_64_
 #ifdef _OS_WINDOWS_
-            stackbase -= 0x40;
+            stackbase -= 0x20;
 #endif
             asm(" movq %0, %%rsp;\n"
                 " xorq %%rbp, %%rbp;\n"

@tkelman
Copy link
Contributor Author

tkelman commented Dec 16, 2014

I'm wondering what the chances are that this was the underlying cause of #7942 (bummer, still get this one) and/or #8895

@vtjnash
Copy link
Member

vtjnash commented Dec 16, 2014

yes, that was certainly possible. what was happening with the old logic is we were unwinding the stack before choosing our RSP pointer, so we were picking the top of the stack to place our jl_rethrow frame, rather than the bottom. that meant it could overwrite meaningful frames with garbage and potentially confuse the stack unwinder. with my recent changes, the top of the stack was now a meaningful stack frame from julia and not some internal bootstrapping frame, so we could see the error more prominently.

@vtjnash
Copy link
Member

vtjnash commented Dec 16, 2014

maybe the underlying cause of #9185 also?

@tkelman
Copy link
Contributor Author

tkelman commented Dec 16, 2014

I'll try to remember to test that (undo my mitigating change to the suitesparse compilation flags) when we get around to backporting the fix, since it was only happening on recent release-0.3.

@tkelman
Copy link
Contributor Author

tkelman commented Dec 16, 2014

Nah, looks like #9185 is still a problem if I build SuiteSparse with MARCH = pentium4, even on latest master apparently.

     * linalg/umfpack

Please submit a bug report with steps to reproduce this fault, and any error messages that follow (in their entirety). Thanks.
Exception: EXCEPTION_ACCESS_VIOLATION at 0x1411ab68 -- unknown function (ip: 336702312)
umfpack_zi_numeric at D:\cygwin64\home\Tony\julia32\usr\bin\libumfpack.DLL (unknown line)
RtlInitUnicodeString at C:\Windows\SysWOW64\ntdll.dll (unknown line)
RtlInitUnicodeString at C:\Windows\SysWOW64\ntdll.dll (unknown line)
Unwind_SjLj_Unregister at D:\cygwin64\home\Tony\julia32\usr\bin\libgcc_s_sjlj-1.dll (unknown line)
RtlInitUnicodeString at C:\Windows\SysWOW64\ntdll.dll (unknown line)
RtlAllocateHeap at C:\Windows\SysWOW64\ntdll.dll (unknown line)
Unwind_SjLj_Register at D:\cygwin64\home\Tony\julia32\usr\bin\libgcc_s_sjlj-1.dll (unknown line)
Znaj at D:\cygwin64\home\Tony\julia32\usr\bin\libstdc++-6.dll (unknown line)
free at C:\Windows\syswow64\msvcrt.dll (unknown line)
free at C:\Windows\syswow64\msvcrt.dll (unknown line)
Makefile:9: recipe for target 'all' failed
make[1]: *** [all] Error 1
Makefile:441: recipe for target 'testall1' failed
make: *** [testall1] Error 2

vtjnash added a commit that referenced this issue Dec 21, 2014
pass a stack-copy of ctxThread in jl_throw_in_ctx to rec_backtrace_ctx

(cherry picked from commit 0b44aca)

fix #7395, and copy just the useful parts of CONTEXT in jl_throw_in_ctx, before rec_backtrace_ctx destroys the recorded stack pointer

(cherry picked from commit ed5a8fc)

Conflicts:
	src/init.c

fix 32 bit oops from ed5a8fc
(cherry picked from commit a2ff1ea)

fix compiler warning on win32

(cherry picked from commit 0050b72)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regression Regression in behavior compared to a previous version system:windows Affects only Windows
Projects
None yet
Development

No branches or pull requests

2 participants