-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
on Windows, (quit) sometimes hangs #24
Comments
It's also happening to me despite the fact I am building with Msys2 (on Windows 10): Seems to happen very late in the game, post-exit when the last thread is shutdown and the kernel waits indefinitely - got a gdb backtrace here. More when I know more ..
|
I've confirmed this on 64-bit Windows 10 with 64-bit Clozure CL 1.11-r16635. The first time (quit) is entered, Clozure hangs and the command line window needs to be exited manually by clicking on its Close button that's labeled with an "X". When Clozure is started and loaded into memory again with a new command line window, (quit) works fine and closes the command line window as Clozure is stopped and unloaded from memory. Subsequent tries starting Clozure and using (quit) hung the command line window 3 times in a row until the 4th try worked in closing the window. |
As a temporary workaround, you can use Credit to |
This is also happening to me with current version (Clozure Common Lisp Version 1.11.5/v1.11.5 (WindowsX8664)), and this has only happened intermitently when i have created background threads and then I want to exit. ( I'm using bordeaux-threads for threads, btw). If i have not started a thread, i always (quit) with no problem. |
(quit) hang on windows cmd windows, but could quit normally on cygwin enviroment. |
I am experienced the same on Win64 - always, in both git bash and DOS prompts, whether installed manually or via Roswell. The only way out is to type |
I rebuild CCL via
Interrupting CCL with Ctrl+C simply kills the process instead of dropping into the debugger. This issue was reproduced on Travis: https://travis-ci.com/phoe-trash/ccl/jobs/250105562 Attaching full compilation log. |
@johnfredcee I was able to reproduce the stacktrace using Windows Process Explorer.
The interesting here is that frame |
Warning, amateur C debugging ahead. My stack looks like this:
At frame 4, we have the Lines 1701 to 1707 in dd5622e
So it seems to attempt to grab the lock for the TCR area, which in turn hangs indefinitely. It is weird, since this thread is currently the only thread on the system! (aside from the debugger thread that, I assume, was created by gdb). Is it possible that another thread has not released that lock properly before dying, which in turn would deadlock the main thread? |
Given its name, I can infer that `lazarus` is mean to resurrect a dying thread if it still has a TCR: if TCR exists, then it calls start_lisp; if it doesn't, then it does nothing. start_lisp, in turn, is an assembly function that has the following comment inside it: This is called from C code when a thread (including the initial thread) starts execution. (Historically, it also provided a primitive way of "resettting" a thread in the event of catastrophic failure, but this hasn't worked in a long time.) I assume that lazarus() is the function mentioned here. It does seem to try and "reset" a thread in the event of it doing an exit call (lazarus is bound to atexit() calls in the main function). If that is true, therefore I infer that a possible fix be to remove this function from the CCL codebase, along with its atexit() bindings. This should solve GitHub issue Clozure#24 where a call that then calls lazarus() sometimes deadlocks when (ccl:quit) is called.
Given its name, I can infer that `lazarus` is mean to resurrect a dying thread if it still has a TCR: if TCR exists, then it calls start_lisp; if it doesn't, then it does nothing. start_lisp, in turn, is an assembly function that has the following comment inside it: This is called from C code when a thread (including the initial thread) starts execution. (Historically, it also provided a primitive way of "resettting" a thread in the event of catastrophic failure, but this hasn't worked in a long time.) I assume that lazarus() is the function mentioned here. It does seem to try and "reset" a thread in the event of it doing an exit call (lazarus is bound to atexit() calls in the main function). If that is true, therefore I infer that a possible fix be to remove this function from the CCL codebase, along with its atexit() bindings. This should solve GitHub issue Clozure#24 where a call that then calls lazarus() sometimes deadlocks when (ccl:quit) is called.
Given its name, I can infer that `lazarus` is mean to resurrect a dying thread if it still has a TCR: if TCR exists, then it calls start_lisp; if it doesn't, then it does nothing. start_lisp, in turn, is an assembly function that has the following comment inside it: This is called from C code when a thread (including the initial thread) starts execution. (Historically, it also provided a primitive way of "resettting" a thread in the event of catastrophic failure, but this hasn't worked in a long time.) I assume that lazarus() is the function mentioned here. It does seem to try and "reset" a thread in the event of it doing an exit call (lazarus is bound to atexit() calls in the main function). If that is true, therefore I infer that a possible fix be to remove this function from the CCL codebase, along with its atexit() bindings. This should solve GitHub issue Clozure#24 where a call that then calls lazarus() sometimes deadlocks when (ccl:quit) is called.
I am testing the above concept on Travis. I told it to repeatedly rebuild CCL using itself; this way, only the first rebuild (the one that uses the original bootstrapping binaries) has a risk of hanging. Commit: phoe-trash@ce4a854
|
Using Travis, I have recompiled CCL 130+ times on Windows (50% of that is 32bit and 50% is 64bit), and I have not seen this issue reappear. (Logs are in the above comment.) I think that I have likely fixed this bug. Everyone affected: please cherrypick phoe-trash@ce4a854, test it yourself on Windows, and tell me if you can observe this issue appearing again. @xrme: please tell me if this change might have any adverse effects on CCL as a whole. Once someone else can confirm the results, I'll submit a PR. |
Given its name, I can infer that `lazarus` is mean to resurrect a dying thread if it still has a TCR: if TCR exists, then it calls start_lisp; if it doesn't, then it does nothing. start_lisp, in turn, is an assembly function that has the following comment inside it: This is called from C code when a thread (including the initial thread) starts execution. (Historically, it also provided a primitive way of "resettting" a thread in the event of catastrophic failure, but this hasn't worked in a long time.) I assume that lazarus() is the function mentioned here. It does seem to try and "reset" a thread in the event of it doing an exit call (lazarus is bound to atexit() calls in the main function). If that is true, therefore I infer that a possible fix be to remove this function from the CCL codebase, along with its atexit() bindings. This should solve GitHub issue Clozure#24 where a call that then calls lazarus() sometimes deadlocks when (ccl:quit) is called.
It is impossible for me to test on Travis because Windows jobs frequently hang; I needed to manually restart https://travis-ci.com/phoe-trash/ccl/jobs/252126627?utm_medium=notification&utm_source=email four times already and and it still fails to run correctly. Please review and merge #233. |
Given its name, I can infer that `lazarus` is mean to resurrect a dying thread if it still has a TCR: if TCR exists, then it calls start_lisp; if it doesn't, then it does nothing. start_lisp, in turn, is an assembly function that has the following comment inside it: This is called from C code when a thread (including the initial thread) starts execution. (Historically, it also provided a primitive way of "resettting" a thread in the event of catastrophic failure, but this hasn't worked in a long time.) I assume that lazarus() is the function mentioned here. It does seem to try and "reset" a thread in the event of it doing an exit call (lazarus is bound to atexit() calls in the main function). If that is true, therefore I infer that a possible fix be to remove this function from the CCL codebase, along with its atexit() bindings. This should solve GitHub issue #24 where a call that then calls lazarus() sometimes deadlocks when (ccl:quit) is called.
http://trac.clozure.com/ccl/ticket/1345
http://trac.clozure.com/ccl/ticket/1393
http://trac.clozure.com/ccl/ticket/1408
http://trac.clozure.com/ccl/ticket/1409
possibly related:
http://trac.clozure.com/ccl/ticket/1142
For some reason, (quit) sometimes hangs on Windows. It seems to happen with both 32- and 64-bit versions of CCL, and on both 32- and 64-bit versions of Windows.
http://trac.clozure.com/ccl/ticket/1345 has the most details.
Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.
The text was updated successfully, but these errors were encountered: