Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

on Windows, (quit) sometimes hangs #24

Closed
xrme opened this issue Mar 4, 2017 · 12 comments · Fixed by #233
Closed

on Windows, (quit) sometimes hangs #24

xrme opened this issue Mar 4, 2017 · 12 comments · Fixed by #233
Labels

Comments

@xrme
Copy link
Member

xrme commented Mar 4, 2017

http://trac.clozure.com/ccl/ticket/1345
http://trac.clozure.com/ccl/ticket/1393
http://trac.clozure.com/ccl/ticket/1408
http://trac.clozure.com/ccl/ticket/1409

possibly related:
http://trac.clozure.com/ccl/ticket/1142

For some reason, (quit) sometimes hangs on Windows. It seems to happen with both 32- and 64-bit versions of CCL, and on both 32- and 64-bit versions of Windows.

http://trac.clozure.com/ccl/ticket/1345 has the most details.

I've noticed that I can avoid the problem quite reliably when I change the compatibility setting of the wx86cl64.exe binary to Windows7. (This setting can be changed in the properties dialogue for the file wx86cl64.exe)


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@xrme xrme added the windows label Mar 4, 2017
@johnfredcee
Copy link
Contributor

johnfredcee commented Mar 29, 2017

It's also happening to me despite the fact I am building with Msys2 (on Windows 10): Seems to happen very late in the game, post-exit when the last thread is shutdown and the kernel waits indefinitely - got a gdb backtrace here. More when I know more ..

By default, when a single location is given, display ten lines.
This can be changed using "set listsize", and the current value
can be shown using "show listsize".
(gdb) info threads
  Id   Target Id         Frame
  1    Thread 16756.0x2594 0x00007ff9c9d84ed4 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
* 2    Thread 16756.0x55dc 0x00007ff9c9d886a1 in ntdll!DbgBreakPoint () from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) thread 1
[Switching to thread 1 (Thread 16756.0x2594)]
#0  0x00007ff9c9d84ed4 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ff9c9d84ed4 in ntdll!ZwWaitForSingleObject () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007ff9c68875ff in WaitForSingleObjectEx () from C:\WINDOWS\System32\KernelBase.dll
#2  0x0000000000035a4c in sem_wait_forever (s=0x1a0) at ../thread_manager.c:462
#3  lock_recursive_lock (m=0x10c6ec0, tcr=tcr@entry=0x22b00620) at ../thread_manager.c:275
#4  0x0000000000027644 in lazarus () at ../pmcl-kernel.c:1693
#5  0x00007ff9c8c1aa8f in msvcrt!_initterm_e () from C:\WINDOWS\System32\msvcrt.dll
#6  0x00007ff9c8be799a in msvcrt!_wfindnexti64 () from C:\WINDOWS\System32\msvcrt.dll
#7  0x00007ff9c9d09d9f in ntdll!RtlDeactivateActivationContextUnsafeFast () from C:\WINDOWS\SYSTEM32\ntdll.dll
#8  0x00007ff9c9ce806b in ntdll!LdrShutdownProcess () from C:\WINDOWS\SYSTEM32\ntdll.dll
#9  0x00007ff9c9ce7d94 in ntdll!RtlExitUserProcess () from C:\WINDOWS\SYSTEM32\ntdll.dll
#10 0x00007ff9c6923dff in _exit () from C:\WINDOWS\System32\KernelBase.dll
#11 0x0000000000026cfd in _SPffcall () at ../x86-spentry64.s:4276
#12 0x0000000022b00620 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thread 2
[Switching to thread 2 (Thread 16756.0x55dc)]
#0  0x00007ff9c9d886a1 in ntdll!DbgBreakPoint () from C:\WINDOWS\SYSTEM32\ntdll.dll
(gdb) bt
#0  0x00007ff9c9d886a1 in ntdll!DbgBreakPoint () from C:\WINDOWS\SYSTEM32\ntdll.dll
#1  0x00007ff9c9db0b2a in ntdll!DbgUiRemoteBreakin () from C:\WINDOWS\SYSTEM32\ntdll.dll
#2  0x00007ff9c93f8364 in KERNEL32!BaseThreadInitThunk () from C:\WINDOWS\System32\kernel32.dll
#3  0x00007ff9c9d45e91 in ntdll!RtlUserThreadStart () from C:\WINDOWS\SYSTEM32\ntdll.dll
#4  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

@ninejaguar
Copy link

I've confirmed this on 64-bit Windows 10 with 64-bit Clozure CL 1.11-r16635.

The first time (quit) is entered, Clozure hangs and the command line window needs to be exited manually by clicking on its Close button that's labeled with an "X".

When Clozure is started and loaded into memory again with a new command line window, (quit) works fine and closes the command line window as Clozure is stopped and unloaded from memory.

Subsequent tries starting Clozure and using (quit) hung the command line window 3 times in a row until the 4th try worked in closing the window.

@cmoore
Copy link

cmoore commented Oct 11, 2017

As a temporary workaround, you can use (#__exit 0). Note that this calls the underlying windows exit() function, so do any cleanup you need to beforehand.

Credit to phoe_ on #lisp

@defunkydrummer
Copy link

This is also happening to me with current version (Clozure Common Lisp Version 1.11.5/v1.11.5 (WindowsX8664)), and this has only happened intermitently when i have created background threads and then I want to exit. ( I'm using bordeaux-threads for threads, btw).

If i have not started a thread, i always (quit) with no problem.

@hufengtao
Copy link

(quit) hang on windows cmd windows, but could quit normally on cygwin enviroment.

@bcalco
Copy link

bcalco commented Jan 13, 2019

I am experienced the same on Win64 - always, in both git bash and DOS prompts, whether installed manually or via Roswell. The only way out is to type (:kill 1) to kill the listener thread (or whichever one describes itself as the listener when you inquire with (:proc)). It looks hung but CTRL+C gets you to a place you can type (quit) and it exits. I imagine then the issue is the main listener thread not letting go. Hope this issue helps someone fix this issue.

@phoe
Copy link
Contributor

phoe commented Oct 28, 2019

I rebuild CCL via (progn (ccl:rebuild-ccl :full t :verbose t) (ccl:quit)). Sometimes, at the end of the building process, the newest CCL master hangs with 0% CPU usage shown in top:

...
;Loading ./bin/ccl-export-syms.wx64fsl
;Loading ./l1-fasls/version.wx64fsl
;Loading H:/tools/msys64/home/emiherd/ccl/bin/jp-encode.wx64fsl
;Loading H:/tools/msys64/home/emiherd/ccl/bin/cn-encode.wx64fsl
;Loading H:/tools/msys64/home/emiherd/ccl/library/lispequ.wx64fsl
;Loading H:/tools/msys64/home/emiherd/ccl/library/sockets.wx64fsl
his file and in "ccl:l1;l1-files.lisp.newest"

Interrupting CCL with Ctrl+C simply kills the process instead of dropping into the debugger.

This issue was reproduced on Travis: https://travis-ci.com/phoe-trash/ccl/jobs/250105562

Attaching full compilation log.

@phoe
Copy link
Contributor

phoe commented Oct 28, 2019

@johnfredcee I was able to reproduce the stacktrace using Windows Process Explorer.

0x0000000000000000
ntdll.dll!ZwWaitForSingleObject+0x14
KERNELBASE.dll!WaitForSingleObjectEx+0x8f
wx86cl64_old.exe+0x25b9c
wx86cl64_old.exe+0x175f4
msvcrt.dll!initterm_e+0x1bf
msvcrt.dll!wfindnexti64+0x6aa
ntdll.dll!RtlDeactivateActivationContextUnsafeFast+0x1bf
ntdll.dll!LdrShutdownProcess+0x14b
ntdll.dll!RtlExitUserProcess+0xb4
KERNELBASE.dll!exit+0xcf
wx86cl64_old.exe+0x16cad
0x0000000000000000

The interesting here is that frame wx86cl64_old.exe+0x25b9c directly calls the hanging KERNELBASE.dll!WaitForSingleObjectEx+0x8f. I am not a Windows developer or proficient with Windows debugging in any way, but this suggests that it might be CCL that is somehow to blame.

@phoe
Copy link
Contributor

phoe commented Oct 28, 2019

Warning, amateur C debugging ahead.

My stack looks like this:

#0  0x00007ffba66e5ac4 in ntdll!ZwWaitForSingleObject ()
   from C:\Windows\SYSTEM32\ntdll.dll
#1  0x00007ffba3824abf in WaitForSingleObjectEx ()
   from C:\Windows\System32\KernelBase.dll
#2  0x0000000000035b9c in sem_wait_forever (s=0x290)
    at ../thread_manager.c:462
#3  lock_recursive_lock (m=0x8d7700, tcr=tcr@entry=0x22a07a0)
    at ../thread_manager.c:275
#4  0x00000000000275f4 in lazarus () at ../pmcl-kernel.c:1707

At frame 4, we have the lazarus function - from its name I assume that it is called when Lisp wants to close itself I'm wrong about that one, but it doesn't really matter here. It does the following:

ccl/lisp-kernel/pmcl-kernel.c

Lines 1701 to 1707 in dd5622e

void
lazarus()
{
TCR *tcr = get_tcr(false);
if (tcr) {
/* Some threads may be dying; no threads should be created. */
LOCK(lisp_global(TCR_AREA_LOCK),tcr);

So it seems to attempt to grab the lock for the TCR area, which in turn hangs indefinitely. It is weird, since this thread is currently the only thread on the system! (aside from the debugger thread that, I assume, was created by gdb).

Is it possible that another thread has not released that lock properly before dying, which in turn would deadlock the main thread?

phoe added a commit to phoe-trash/ccl that referenced this issue Oct 28, 2019
Given its name, I can infer that `lazarus` is mean to resurrect a dying thread if it still has a TCR: if TCR exists, then it calls start_lisp; if it doesn't, then it does nothing. start_lisp, in turn, is an assembly function that has the following comment inside it:

  This is called from C code when a thread (including the initial thread) starts execution.  (Historically, it also provided a primitive way of "resettting" a thread in the event of catastrophic failure, but this hasn't worked in a long time.)

I assume that lazarus() is the function mentioned here. It does seem to try and "reset" a thread in the event of it doing an exit call (lazarus is bound to atexit() calls in the main function).

If that is true, therefore I infer that a possible fix be to remove this function from the CCL codebase, along with its atexit() bindings. This should solve GitHub issue Clozure#24 where a call that then calls lazarus() sometimes deadlocks when (ccl:quit) is called.
phoe added a commit to phoe-trash/ccl that referenced this issue Oct 28, 2019
Given its name, I can infer that `lazarus` is mean to resurrect a dying thread if it still has a TCR: if TCR exists, then it calls start_lisp; if it doesn't, then it does nothing. start_lisp, in turn, is an assembly function that has the following comment inside it:

  This is called from C code when a thread (including the initial thread) starts execution.  (Historically, it also provided a primitive way of "resettting" a thread in the event of catastrophic failure, but this hasn't worked in a long time.)

I assume that lazarus() is the function mentioned here. It does seem to try and "reset" a thread in the event of it doing an exit call (lazarus is bound to atexit() calls in the main function).

If that is true, therefore I infer that a possible fix be to remove this function from the CCL codebase, along with its atexit() bindings. This should solve GitHub issue Clozure#24 where a call that then calls lazarus() sometimes deadlocks when (ccl:quit) is called.
phoe added a commit to phoe-trash/ccl that referenced this issue Oct 28, 2019
Given its name, I can infer that `lazarus` is mean to resurrect a dying thread if it still has a TCR: if TCR exists, then it calls start_lisp; if it doesn't, then it does nothing. start_lisp, in turn, is an assembly function that has the following comment inside it:

  This is called from C code when a thread (including the initial thread) starts execution.  (Historically, it also provided a primitive way of "resettting" a thread in the event of catastrophic failure, but this hasn't worked in a long time.)

I assume that lazarus() is the function mentioned here. It does seem to try and "reset" a thread in the event of it doing an exit call (lazarus is bound to atexit() calls in the main function).

If that is true, therefore I infer that a possible fix be to remove this function from the CCL codebase, along with its atexit() bindings. This should solve GitHub issue Clozure#24 where a call that then calls lazarus() sometimes deadlocks when (ccl:quit) is called.
@phoe
Copy link
Contributor

phoe commented Oct 28, 2019

@phoe
Copy link
Contributor

phoe commented Oct 28, 2019

Using Travis, I have recompiled CCL 130+ times on Windows (50% of that is 32bit and 50% is 64bit), and I have not seen this issue reappear. (Logs are in the above comment.)

I think that I have likely fixed this bug.

Everyone affected: please cherrypick phoe-trash@ce4a854, test it yourself on Windows, and tell me if you can observe this issue appearing again.

@xrme: please tell me if this change might have any adverse effects on CCL as a whole.

Once someone else can confirm the results, I'll submit a PR.

phoe added a commit to phoe-trash/ccl that referenced this issue Oct 31, 2019
Given its name, I can infer that `lazarus` is mean to resurrect a dying thread if it still has a TCR: if TCR exists, then it calls start_lisp; if it doesn't, then it does nothing. start_lisp, in turn, is an assembly function that has the following comment inside it:

  This is called from C code when a thread (including the initial thread) starts execution.  (Historically, it also provided a primitive way of "resettting" a thread in the event of catastrophic failure, but this hasn't worked in a long time.)

I assume that lazarus() is the function mentioned here. It does seem to try and "reset" a thread in the event of it doing an exit call (lazarus is bound to atexit() calls in the main function).

If that is true, therefore I infer that a possible fix be to remove this function from the CCL codebase, along with its atexit() bindings. This should solve GitHub issue Clozure#24 where a call that then calls lazarus() sometimes deadlocks when (ccl:quit) is called.
@phoe
Copy link
Contributor

phoe commented Nov 2, 2019

It is impossible for me to test on Travis because Windows jobs frequently hang; I needed to manually restart https://travis-ci.com/phoe-trash/ccl/jobs/252126627?utm_medium=notification&utm_source=email four times already and and it still fails to run correctly.

Please review and merge #233.

@xrme xrme closed this as completed in #233 Nov 2, 2019
xrme pushed a commit that referenced this issue Aug 11, 2023
Given its name, I can infer that `lazarus` is mean to resurrect a dying thread if it still has a TCR: if TCR exists, then it calls start_lisp; if it doesn't, then it does nothing. start_lisp, in turn, is an assembly function that has the following comment inside it:

  This is called from C code when a thread (including the initial thread) starts execution.  (Historically, it also provided a primitive way of "resettting" a thread in the event of catastrophic failure, but this hasn't worked in a long time.)

I assume that lazarus() is the function mentioned here. It does seem to try and "reset" a thread in the event of it doing an exit call (lazarus is bound to atexit() calls in the main function).

If that is true, therefore I infer that a possible fix be to remove this function from the CCL codebase, along with its atexit() bindings. This should solve GitHub issue #24 where a call that then calls lazarus() sometimes deadlocks when (ccl:quit) is called.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants