Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macos ci build times out every now and again #2837

Open
dankamongmen opened this issue Jan 11, 2025 · 18 comments
Open

macos ci build times out every now and again #2837

dankamongmen opened this issue Jan 11, 2025 · 18 comments
Assignees
Labels
bug Something isn't working input readin' dem bytes macOS
Milestone

Comments

@dankamongmen
Copy link
Owner

See for instance https://github.com/dankamongmen/notcurses/actions/runs/12726507286/job/35474975934#step:6:44 (a 3.0.12-dirty buid):

est project /Users/runner/work/notcurses/notcurses/build
12      Start  1: notcurses-tester
13 1/10 Test  #1: notcurses-tester .................***Timeout 1500.06 sec
14      Start  2: notcurses-info
15 2/10 Test  #2: notcurses-info ...................   Passed    0.06 sec
16      Start  3: ncpp_build
17 3/10 Test  #3: ncpp_build .......................   Passed    0.05 sec
18      Start  4: ncpp_build_exceptions
19 4/10 Test  #4: ncpp_build_exceptions ............   Passed    0.04 sec
20      Start  5: input-devnull
21 5/10 Test  #5: input-devnull ....................   Passed    0.31 sec
22      Start  6: input-text
23 6/10 Test  #6: input-text .......................   Passed    0.36 sec
24      Start  7: sgr-direct
25 7/10 Test  #7: sgr-direct .......................   Passed    0.06 sec
26      Start  8: sgr-full
27 8/10 Test  #8: sgr-full .........................   Passed    0.05 sec
28      Start  9: rgb
29 9/10 Test  #9: rgb ..............................   Passed    0.05 sec
30      Start 10: rgbbg
3110/10 Test #10: rgbbg ............................   Passed    0.05 sec
32
33 90% tests passed, 1 tests failed out of 10
34
35 Total Test time (real) = 1501.18 sec
36
37 The following tests FAILED:
38	  1 - notcurses-tester (Timeout)
39 Errors while running CTest
40 Output from these tests are in: /Users/runner/work/notcurses/notcurses/build/Testing/Temporary/LastTest.log
41 Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
42 make: *** [test] Error 8
43 Error: Process completed with exit code 2.

so there are two problems here: we ought rerun with --output-on-failure in the CI, to have more info on this kind of thing. two, we're seeing this timeout. i think i've seen it locally, too, though i thought it was one of the input test programs. maybe it's a general problem? either way, let's get this figured out.

@dankamongmen dankamongmen added the bug Something isn't working label Jan 11, 2025
@dankamongmen dankamongmen added this to the 3.1.0 milestone Jan 11, 2025
@dankamongmen dankamongmen self-assigned this Jan 11, 2025
@dankamongmen
Copy link
Owner Author

i've made the following change on the macos ci:

-          make test
-          #ctest --output-on-failure
+          make test || { ctest --output-on-failure ; false ; }

so we'll still fail out if make test failed, but we'll also rerun.

@dankamongmen
Copy link
Owner Author

huh we could just run ctest --output-on-failure instead of make test, like we're doing for the linux build, no? that way we always get the error output for the failure.

@dankamongmen
Copy link
Owner Author

it's important to note that this does not happen all the time, so a successful build doesn't mean it's fixed.

@dankamongmen
Copy link
Owner Author

i just had a notcurses-demo from 3.0.13 hang on exit. i got stack traces:

(gdb) thread apply all bt

Thread 2 (Thread 0x7f57fab966c0 (LWP 2761558) "notcurses-demo"):
#0  futex_wait (futex_word=0x5653640b1d18, expected=2, private=0) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x5653640b1d18, private=0) at ./nptl/lowlevellock.c:49
#2  0x00007f57ffc503a2 in lll_mutex_lock_optimized (mutex=0x5653640b1d18) at ./nptl/pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x5653640b1d18) at ./nptl/pthread_mutex_lock.c:93
#4  0x00007f57ffdd52ef in load_ncinput (ictx=ictx@entry=0x5653640adc90, tni=tni@entry=0x7f57fab95ac0) at ./src/lib/in.c:522
#5  0x00007f57ffdd5560 in kitty_kbd_txt (ictx=ictx@entry=0x5653640adc90, val=113, mods=<optimized out>, mods@entry=1, txt=txt@entry=0x0, evtype=3) at ./src/lib/in.c:839
#6  0x00007f57ffdd5dff in kitty_kbd (ictx=0x5653640adc90, val=<optimized out>, mods=1, evtype=<optimized out>) at ./src/lib/in.c:844
#7  kitty_cb_complex (ictx=0x5653640adc90) at ./src/lib/in.c:1151
#8  0x00007f57ffdd6220 in process_escape (ictx=ictx@entry=0x5653640adc90, buf=buf@entry=0x5653640afc90 "\033[113;1:3uu:afaf/afaf/d7d7\033\\\033]4;147;rgb:afaf/afaf/ffff\033\\\033]4;148;rgb:afaf/d7d7/0000\033\\\033]4;149;rgb:afaf/d7d7/5f5f\033\\\033]4;150;rgb:afaf/d7d7/8787\033\\\033]4;151;rgb:afaf/d7d7/afaf\033\\\033]4;152;rgb:afaf/d7d7/d7d7\033\\\033]4;"..., buflen=10) at ./src/lib/in.c:2238
#9  0x00007f57ffdd77a8 in process_escape (ictx=0x5653640adc90, buf=<optimized out>, buflen=<optimized out>) at ./src/lib/in.c:2214
#10 process_melange (ictx=0x5653640adc90, buf=0x5653640afc90 "\033[113;1:3uu:afaf/afaf/d7d7\033\\\033]4;147;rgb:afaf/afaf/ffff\033\\\033]4;148;rgb:afaf/d7d7/0000\033\\\033]4;149;rgb:afaf/d7d7/5f5f\033\\\033]4;150;rgb:afaf/d7d7/8787\033\\\033]4;151;rgb:afaf/d7d7/afaf\033\\\033]4;152;rgb:afaf/d7d7/d7d7\033\\\033]4;"..., bufused=0x5653640b1cd0) at ./src/lib/in.c:2397
#11 0x00007f57ffdd80f9 in process_ibuf (ictx=<optimized out>) at ./src/lib/in.c:2451
#12 input_thread (vmarshall=0x5653640adc90) at ./src/lib/in.c:2622
#13 0x00007f57ffc4d083 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#14 0x00007f57ffccb7b8 in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 1 (Thread 0x7f57fb779880 (LWP 2761557) "notcurses-demo"):
#0  0x00007f57ffc49a1e in __futex_abstimed_wait_common64 (private=128, futex_word=0x7f57fab96990, expected=2761558, op=265, abstime=0x0, cancel=true) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7f57fab96990, expected=2761558, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128, cancel=cancel@entry=true) at ./nptl/futex-internal.c:87
#2  0x00007f57ffc49a9b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7f57fab96990, expected=<optimized out>, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128) at ./nptl/futex-internal.c:139
#3  0x00007f57ffc4ebc3 in __pthread_clockjoin_ex (threadid=threadid@entry=140015845336768, thread_return=thread_return@entry=0x0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, block=block@entry=true) at ./nptl/pthread_join_common.c:102
#4  0x00007f57ffc4ea6f in ___pthread_join (threadid=threadid@entry=140015845336768, thread_return=thread_return@entry=0x0) at ./nptl/pthread_join.c:24
#5  0x00007f57ffdda471 in cancel_and_join (name=0x7f57ffe1c754 "input", res=0x0, tid=140015845336768) at ./src/lib/internal.h:1840
#6  stop_inputlayer (ti=ti@entry=0x5653640a5660) at ./src/lib/in.c:2651
#7  0x00007f57ffe0c082 in free_terminfo_cache (ti=ti@entry=0x5653640a5660) at ./src/lib/termdesc.c:198
#8  0x00007f57ffde9dce in notcurses_stop (nc=0x5653640a5360) at ./src/lib/notcurses.c:1472
#9  0x00005653471d4ddb in ?? ()
#10 0x00007f57ffbe4d68 in __libc_start_call_main (main=main@entry=0x5653471d3d60, argc=argc@entry=3, argv=argv@entry=0x7ffdc548da18) at ../sysdeps/nptl/libc_start_call_main.h:58
#11 0x00007f57ffbe4e25 in __libc_start_main_impl (main=0x5653471d3d60, argc=3, argv=0x7ffdc548da18, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdc548da08) at ../csu/libc-start.c:360
#12 0x00005653471d50c1 in ?? ()
(gdb) 

@dankamongmen
Copy link
Owner Author

hrmmm so input thread is blocking on a lock, not in fgetc(), interesting! that ought make things a lot simpler.

@dankamongmen
Copy link
Owner Author

wow, that locks hard, too -- neither ^C nor ^Z break it

@dankamongmen
Copy link
Owner Author

hrmmm it's not obvious how the input thread could be blocking on taking ilock.

@dankamongmen
Copy link
Owner Author

huh pthread_cond_wait() holds the lock even when returning an error, huh

@dankamongmen
Copy link
Owner Author

so i'm seeing this when running notcurses-demo and pressing q to exit pretty reliably. in that case, we've just read from the input layer...but we shouldn't have it locked by any means. who could be holding this lock?

@dankamongmen
Copy link
Owner Author

hrmm i can't reproduce the notcurses-demo lockup on arbitrary terminals, only on ghostty. very strange!

@dankamongmen
Copy link
Owner Author

i've added some debug code. normally when we see taking lock, we see got lock immediately afterwards:

process_escape:2246:walk result on 49 (1): 0 273
process_escape:2246:walk result on 58 (:): 0 290
process_escape:2246:walk result on 51 (3): 0 291
kitty_kbd_txt:808:v/m/e 1115121 1 3
load_ncinput:522:taking lock
load_ncinput:532:got lock
mark_pipe_ready:475:wrote to readiness pipe
load_ncinput:554:unlocking
process_escape:2246:walk result on 117 (u): 2 292
block_on_input:2484:blocking on input availability
...
process_escape:2246:walk result on 49 (1): 0 275
process_escape:2246:walk result on 49 (1): 0 275
process_escape:2246:walk result on 51 (3): 0 275
kitty_kbd_txt:808:v/m/e 113 0 0
load_ncinput:522:taking lock
load_ncinput:532:got lock
mark_pipe_ready:475:wrote to readiness pipe
load_ncinput:554:unlocking
process_escape:2246:walk result on 117 (u): 2 280
block_on_input:2484:blocking on input availability
block_on_input:2541:waiting on 1 fds (ibuf: 0/8192)
internal_get:2738:draining event readiness pipe 0
ncplane_destroy:1030:destroying 5x25 plane "hud" @ 33x32
ncplane_destroy:1030:destroying 6x72 plane "fps

but then we don't:

process_escape:2246:walk result on 58 (:): 0 290
process_escape:2246:walk result on 51 (3): 0 291
kitty_kbd_txt:808:v/m/e 113 1 3
load_ncinput:522:taking lock
notcurses_rasterize_inner:1276:pile 0x55989260b210 ymax: 45 xmax: 90
notcurses_rasterize_inner:1282:sprixel phase 1
clean_sprixels:892:phase 1 sprixel 551486 state 5 loc 13/43
clean_sprixels:892:phase 1 sprixel 551486 state 3 loc 13/43
sprite_redraw:733:sprixel 551486 state 3
kitty_draw:1180:dumping 683013b for 551486 at 0 0
notcurses_rasterize_inner:1287:glyph phase 1
notcurses_rasterize_inner:1291:sprixel phase 2

@dankamongmen
Copy link
Owner Author

looks like maybe a path in internal_get()? possibly it's getting cancelled in the fgetc() and not unlocking? quite possible.

@dankamongmen
Copy link
Owner Author

dankamongmen commented Jan 24, 2025

also, if you get cancelled in a condition variable wait, don't you reaquire the lock? i think you do. yeah:

CANCELLATION
pthread_cond_wait and pthread_cond_timedwait are cancelation points. If a thread is cancelled while suspended in one of these functions, the thread immediately resumes execution, then locks again the mutex argument to thread_cond_wait and pthread_cond_timedwait, and finally executes the cancelation. Consequently, cleanup handlers are assured that mutex is locked when they are called.

so we need a cleanup handler there, too.

@dankamongmen
Copy link
Owner Author

yep, looks like that got it, w00t! not sure if this is the same thing responsible for the macos ci hang but i suspect it is.

@dankamongmen
Copy link
Owner Author

looks like we just hit this again at https://github.com/dankamongmen/notcurses/actions/runs/13054111564 arrrrrr

@dankamongmen
Copy link
Owner Author

ok the macos issue does not necessarily look input related. i've got a (huge) logfile.

2025-01-30T14:09:50.1344880Z [100%] Linking CXX executable notcurses-tester                                                         
2025-01-30T14:09:50.2092680Z [100%] Built target notcurses-tester                                                                   
2025-01-30T14:09:50.2245860Z ##[group]Run cd build                                                                                  
2025-01-30T14:09:50.2246040Z ^[[36;1mcd build^[[0m                                                                                  
2025-01-30T14:09:50.2246180Z ^[[36;1mctest --output-on-failure^[[0m                                                                 
2025-01-30T14:09:50.2341630Z shell: /bin/bash -e {0}                                                                                
2025-01-30T14:09:50.2341800Z env:                                                                                                   
2025-01-30T14:09:50.2341930Z   COLORTERM: truecolor                                                                                 
2025-01-30T14:09:50.2342060Z   NPROC: 2                                                                                             
2025-01-30T14:09:50.2342160Z   TERM: xterm                                                                                          
2025-01-30T14:09:50.2342280Z ##[endgroup]                                                                                           
2025-01-30T14:09:50.3114940Z Test project /Users/runner/work/notcurses/notcurses/build                                              
2025-01-30T14:09:50.3132490Z       Start  1: notcurses-tester                                                                       
2025-01-30T14:34:50.4544970Z  1/10 Test  #1: notcurses-tester .................***Timeout 1500.06 sec                               
2025-01-30T14:34:50.5266270Z Running with LANG=en_US.UTF-8                                                                          
2025-01-30T14:34:50.5266700Z Running with TERM=xterm                                                                                
2025-01-30T14:34:50.5270240Z ^[[39;49m^[(B^[[m^[[?25lDetected cell geometry: 80x24                                                  
2025-01-30T14:34:50.5270760Z ^[[39;49m^[(B^[[m^[[?1l^[>^[[?12l^[[?25h[doctest] doctest version is "2.4.11"                          
2025-01-30T14:34:50.5283770Z [doctest] run with "--help" for options                                                                
2025-01-30T14:34:50.5512830Z ^[[39;49m^[(B^[[m^[[?25l^[[39;49m^[(B^[[m^[[?1l^[>^[[?12l^[[?25h^[[39;49m^[(B^[[m^[[?25l^[[39;49m^[(B^[
2025-01-30T14:34:50.5538910Z ^[[23m^[[39;49m^[(B^[[m^[[1mDirectMode *bold*!                                                         
2025-01-30T14:34:50.5539200Z ^[[22m^[[39;49m^[(B^[[m^[[4mDirectMode *underline*!                                                    
2025-01-30T14:34:50.5539470Z ^[[24m^[[39;49m^[(B^[[m^[[39;49m^[(B^[[m^[[9mDirectMode *struck*!                                      
2025-01-30T14:34:50.5539740Z ^[[29m^[[39;49m^[(B^[[m^[[48;2;136;136;136mtest background                                             
2025-01-30T14:34:50.5539990Z ^[[38;2;255;0;255m^[[49m/\^[[2D^K\/   

@dankamongmen
Copy link
Owner Author

so can we not run with output-on-err or whatever it is here?

also, we get some compiler warnings on the macos build, it seems:

Z /Users/runner/work/notcurses/notcurses/src/lib/unixsig.c:7:34: warning: macro 'ATOMIC_VAR_INIT' has bee
2025-01-30T14:09:15.3206940Z static void* _Atomic signal_nc = ATOMIC_VAR_INIT(NULL);                                                
2025-01-30T14:09:15.3327660Z                                  ^                                                                     
2025-01-30T14:09:15.3391600Z /Applications/Xcode_15.4.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/15.0.
2025-01-30T14:09:15.3392270Z [ 21%] Building C object CMakeFiles/notcurses-core-static.dir/src/lib/util.c.o                         
2025-01-30T14:09:15.3458170Z #pragma clang deprecated(ATOMIC_VAR_INIT)                                                              
2025-01-30T14:09:15.3526420Z                                         ^                                                              
2025-01-30T14:09:15.3526730Z [ 21%] Building C object CMakeFiles/notcurses-core.dir/src/lib/util.c.o                                
2025-01-30T14:09:15.3543930Z /Users/runner/work/notcurses/notcurses/src/lib/unixsig.c:7:34: warning: macro 'ATOMIC_VAR_INIT' has bee
2025-01-30T14:09:15.3569590Z static void* _Atomic signal_nc = ATOMIC_VAR_INIT(NULL);                                                
2025-01-30T14:09:15.3636340Z                                  ^                                                                     
2025-01-30T14:09:15.3666930Z /Applications/Xcode_15.4.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/15.0.
2025-01-30T14:09:15.3669020Z #pragma clang deprecated(ATOMIC_VAR_INIT)                                                              
2025-01-30T14:09:15.3670650Z                                         ^                                                              
2025-01-30T14:09:15.3671320Z 1 warning generated.                                                                                   
2025-01-30T14:09:15.3671950Z 1 warning generated.   

@dankamongmen
Copy link
Owner Author

2025-01-30T14:09:38.8099230Z /Users/runner/work/notcurses/notcurses/src/tests/metric.cpp:95:5: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only.  Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [
2025-01-30T14:09:38.8202020Z     sprintf(gold, "%.2fEi", ((double)(INTMAX_MAX - 1ull)) / (1ull << 60));                                                                                                                                                                                                                
2025-01-30T14:09:38.8303760Z     ^                                                                                                                                                                                                                                                                                     
2025-01-30T14:09:38.8335010Z /Applications/Xcode_15.4.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.5.sdk/usr/include/stdio.h:180:1: note: 'sprintf' has been explicitly marked deprecated here                                                                                             
2025-01-30T14:09:38.8336200Z __deprecated_msg("This function is provided for compatibility reasons only.  Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.")                                                                                  
2025-01-30T14:09:38.8336820Z ^                                                                                                                                                                                                                                                                                         
2025-01-30T14:09:38.8337410Z /Applications/Xcode_15.4.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.5.sdk/usr/include/sys/cdefs.h:218:48: note: expanded from macro '__deprecated_msg'                                                                                                      
2025-01-30T14:09:38.8338140Z         #define __deprecated_msg(_msg) __attribute__((__deprecated__(_msg)))                                                                                                                                                                                                              
2025-01-30T14:09:38.8459120Z                                                       ^                                                                                                                                                                                                                                   
2025-01-30T14:09:38.8460040Z /Users/runner/work/notcurses/notcurses/src/tests/metric.cpp:98:5: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only.  Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead. [
                                                                                                                                                                                                                                                     

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working input readin' dem bytes macOS
Projects
None yet
Development

No branches or pull requests

1 participant