-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sighup fix #1
Sighup fix #1
Conversation
This is a fix for an issue which was reported to the ast-developers list. The details can be found at https://www.mail-archive.com/[email protected]/msg01887.html The fix ensures that the HUP signal is sent to the process group only if there is some process with the same process group id whose P_DONE is not set, i.e which hasn't yet completed execution and "wait"ed upon yet. Fix provided by [email protected]
Besides the obvious style issues (e.g., the extraneous parens in statements like |
* being disconnected. | ||
*/ | ||
int | ||
job_hup(struct process *pw, int sig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the sig
argument if it's dedicated to SIGHUP
handling.
#endif /* SHOPT_COSHELL */ | ||
|
||
job_lock(); | ||
if (pw->p_pgrp != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been tested above. This condition can not be false
.
job_unstop(pw); | ||
} | ||
} | ||
for (; pw != NULL && pw->p_pgrp == 0; pw = pw->p_nxtproc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop will never execute as we return from this function if pw->p_pgrp == 0
above.
pid_t pid; | ||
int r; | ||
|
||
if (pw->p_pgrp == 0 || (pw->p_flag & P_DISOWN)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious how SIGHUP
will be handled for processes in current process group if we return from here.
After further reflection I recommend rejecting this change. The fact that the current code is sending a signal to a process that is no longer a child process, other than in response to an explicit |
FWIW, This change is from https://github.com/oracle/solaris-userland/blob/master/components/ksh93/patches/260-22964338.patch. |
For now I am closing this pull request. As @krader1961 mentioned we need to have a deeper look at the job management code. |
Signed-off-by: Igor Gnatenko <[email protected]>
This fixes the following bug filed with Solaris: "22964338 ksh93 appears to send SIGHUP to unrelated processes on occasion". It is fixed by applying this patch by Lijo George from the Solaris repo: https://github.com/oracle/solaris-userland/blob/master/components/ksh93/patches/260-22964338.patch The ksh2020 upstream rejected this, but if it's in production use in Solaris, Solaris, it's probably good enough for 93u+m. If any breakage is left, it can be fixed later. att#1 src/cmd/ksh93/include/jobs.h, src/cmd/ksh93/sh/fault.c, src/cmd/ksh93/sh/jobs.c: - Use a new job_hup() function instead of job_kill() to send SIGHUP to job processes on termination. The new function checks if a job is in fact still live before issuing SIGHUP to it.
`builtin -d` should not delete special builtins
ksh segfaults in job_chksave after receiving SIGCHLD https://bugs.launchpad.net/ubuntu/+source/ksh/+bug/1697501 Eric Desrochers wrote on 2017-06-12: [Impact] * The compiler optimization dropped parts from the ksh job locking mechanism from the binary code. As a consequence, ksh could terminate unexpectedly with a segmentation fault after it received the SIGCHLD signal. [Test Case] Unfortunately, there is no clear and easy way to reproduce the segfault. * But the original reporter of this bug can randomly reproduce the problem using an in-house ksh script that only works inside his infrastructure as follow : "ksh <in-house-script.ksh>" and then once in a while ksh will segfault as follow : (gdb) bt #0 job_chksave (pid=pid@entry=19003) at /build/ksh-6IEHIC/ksh-93u+20120801/src/cmd/ksh93/sh/jobs.c:1948 att#1 0x00000000004282ab in job_reap (sig=17) at /build/ksh-6IEHIC/ksh-93u+20120801/src/cmd/ksh93/sh/jobs.c:428 att#2 <signal handler called> ... [Regression Potential] * Regression risk : low/none expected, the package has been highly/intensively tested by a user who run over 18M ksh scripts a day on each of their clusters. [...] * The fix has been written by RH and has been proven to work for them for the last 3 years. * A test package including the RH fix has been intensively tested and verified (pre-SRU) by an affected user with positive feedbacks using a reproducer that segfault without the RH patch. * Test package (pre-SRU) feedbacks : https://bugs.launchpad.net/ubuntu/xenial/+source/ksh/+bug/1697501/comments/7 [Other Info] * Details about the RH bug : - https://bugzilla.redhat.com/show_bug.cgi?id=1123467 - https://bugzilla.redhat.com/show_bug.cgi?id=1112306 - https://access.redhat.com/solutions/1253243 - http://rhn.redhat.com/errata/RHBA-2014-1015.html - ksh.spec * Fri Jul 25 2014 Michal Hlavinka <email address hidden> - 20120801-10.8 * job locking mechanism did not survive compiler optimization (#1123467) - patch * ksh-20120801-locking.patch Debian bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=867181 [Original Description] # gdb [New LWP 3882] Core was generated by `/bin/ksh <KSH_SCRIPT>.ksh'. Program terminated with signal SIGSEGV, Segmentation fault. #0 job_chksave (pid=pid@entry=19385) at /build/ksh-6IEHIC/ksh-93u+20120801/src/cmd/ksh93/sh/jobs.c:1948 1948 if(jp->pid==pid) (gdb) p *jp Cannot access memory at address 0xb (gdb) p *jp->pid Cannot access memory at address 0x13 (gdb) p pid $2 = 19385 (gdb) p *jpold $1 = {next = 0xb, pid = -604008960, exitval = 11124} The struct is corrupted at some point looking at the next,pid and exitval struct members values which isn't valid data. # assembly code => 0x0000000000427159 <+41>: cmp %edi,0x8(%rdx) (gdb) p $edi ## pid variable $1 = 19385 (gdb) p *($rdx + 8) ## jp->pid struct Cannot access memory at address 0x13 -- ksh is segfaulting because it can't access struct "jp" ($rdx) thus cannot de-reference the struct member "jp>pid" ($rdx + 8) at line : src/cmd/ksh93/sh/jobs.c:1948 when looking if jp->pid is equal to pid ($edi) variable.
Hopefully this doesn't introduce new bugs, but it does fix at least the following: 1. When whence -v/-a found an "undefined" (i.e. autoloadable) function in $FPATH, it actually loaded the function as a side effect of reporting on its existence (!). Now it only reports. 2. 'whence' will now canonicalise paths properly. Examples: $ whence ///usr/lib/../bin//./env /usr/bin/env $ (cd /; whence -v dev/../usr/bin//./env) dev/../usr/bin//./env is /usr/bin/env 3. 'whence' no longer prefixes a spurious double slash when doing something like 'cd / && whence bin/echo'. On Cygwin, an initial double slash denotes a network server, so this was not just a cosmetic problem. 4. 'whence -a' now reports a "tracked alias" (a.k.a. hash table entry, i.e. cached $PATH search) even if an actual alias by the same name exists. This needed fixing because in fact the hash table entry continues to be used when bypassing the alias. Aliases and "tracked aliases" are not remotely the same thing; confusing nomenclature is not a reason to report wrong results. 5. When using 'hash' or 'alias -t' on a command that is also a builtin to force caching a $PATH search for the external command, 'whence -a' double-reported the path: $ hash printf; whence -a printf printf is a shell builtin printf is /usr/bin/printf printf is a tracked alias for /usr/bin/printf This is now fixed so that the second output line is gone. Plus, if there were multiple versions of the command on $PATH, the tracked alias was reported at the end, which is the wrong order. This is also fixed. src/cmd/ksh93/bltins/whence.c: whence(): - Refactor the do...while loop that handles whence -v/-a for path searches in such a way that the code actually makes sense and stops looking like higher esotericism. Just doing this fixed att#2, att#4 and att#5 above (the latter two before I even noticed them). For instance, the path_fullname() call to canonicalise paths was already there; it was just never used. - Remove broken 'notrack' flaggery for deciding whether to report a hash table entry a.k.a. "tracked alias"; instead, check the hash table (shp->track_tree). src/cmd/ksh93/sh/path.c: - path_search(): Re att#3: When prefixing the PWD, first check if we're in '/' and if so, don't prefix it; otherwise, adding the next slash causes an initial double slash. (Since '/' is the only valid single-character absolute path, all we need to do is check if the second character pwd[1] is non-null.) - path_search(): Re att#1: Stop autoloading when called by 'whence': * The 'flag==2' check to avoid autoloading a function was broken. The flag value is 2 on the first whence() loop iteration, but 3 on subsequent ones. Change to 'flag >= 2'. * However, this only fixes it if the function file does not have the x permission bit, as executable files are handled by path_absolute() which unconditionally autoloads functions! So, pass on our flag parameter when callling path_absolute(). - path_absolute(): Re att#1: Add flag parameter. Do not autoload functions if flag >= 2. src/cmd/ksh93/include/path.h, src/cmd/ksh93/bltins/typeset.c, src/cmd/ksh93/sh/main.c, src/cmd/ksh93/sh/xec.c: - Re att#1: Update path_absolute() calls, adding a 0 flag parameter. src/cmd/ksh93/include/name.h: - Remove now-unused pathcomp member from union Value. It was introduced in 9906535 to allow examining the value of a tracked alias. This commit uses nv_getval() instead. src/cmd/ksh93/tests/builtins.sh, src/cmd/ksh93/tests/path.sh: - Add and tweak various related tests. Fixes: ksh93#84
For one Red Hat customer, the following reproducer consistently crashed, tough I was not able to reproduce it and neither was RH. However, the crash analysis is sound (see below). function dlog { fc -ln -0 } trap dlog DEBUG >/tmp/blah Original patch: https://src.fedoraproject.org/rpms/ksh/blob/642af4d6/f/ksh-20140801-arraylen.patch The Red Hat bug thread is closed to the public as it also contains some correspondence with their customer. But it has an excellent crash analysis from Thomas Gardner which I'm including here for the record (the line numbers are for their ksh at the time, not 93u+m). ===begin analysis=== > The creation of an empty file instead of a command that executes > anything causes the coredump. [...] > Here is my analysis on the core that was provided by the customer: > > (gdb) bt > #0 sh_fmtq (string=0x1 <Address 0x1 out of bounds>) > at /usr/src/debug/ksh-20120801/src/cmd/ksh93/sh/string.c:340 > att#1 0x0000000000457e96 in out_string (cp=<value optimized out>, c=32, > quoted=<value optimized out>, iop=<value optimized out>) > at /usr/src/debug/ksh-20120801/src/cmd/ksh93/sh/xec.c:444 > att#2 0x000000000045804c in sh_debug (shp=0x76d180, trap=0x7f2f13a821e0 "dlog", > name=<value optimized out>, subscript=<value optimized out>, > argv=0x76e070, flags=<value optimized out>) > at /usr/src/debug/ksh-20120801/src/cmd/ksh93/sh/xec.c:548 > att#3 0x000000000045a867 in sh_exec (t=0x7f2f13aafad0, flags=4) > at /usr/src/debug/ksh-20120801/src/cmd/ksh93/sh/xec.c:1265 > [...need go no further...] > > In frame 2, we can see it cycling through your classic > (char **)argv array like: > > 543 while(cp = *argv++) > 544 { > 545 if((flags&ARG_EXP) && argv[1]==0) > 546 out_pattern(iop, cp,' '); > 547 else > 548 out_string(iop, cp,' ',n?0: (flags&(ARG_RAW|ARG_NOGLOB))||*argv); > 549 } > 550 if(flags&ARG_ASSIGN) > 551 sfputc(iop,')'); > 552 else if(iop==stkstd) > > (we seg-fault after going down the out_string function in line > 548 up there). The string pointer that points to = 0x1 up in > frame #0 (sh_fmtq) traces back to the "cp" variable in line 548 > up there. The "argv" variable being referenced up there just gets > passed in as the fifth argument to this function. > > In frame att#3 (sh_exec, line 1265), the line that makes the call > that takes us to frame 2 is: > > 1265 int n = sh_debug(shp,trap,(char*)0,(char*)0, com, ARG_R AW); > > so "com" (the fifth argument) is what's going wrong as it > descends down through these calls. Looking at where it comes > from, well, it's assigned here: > > 1241 if(argn==0) > 1242 { > 1243 /* fake 'true' built-in */ > 1244 np = SYSTRUE; > 1245 *argv = nv_name(np); > 1246 com = argv; > 1247 } > > because as we can see: > > (gdb) f 3 > att#3 0x000000000045a867 in sh_exec (t=0x7f2f13aafad0, flags=4) > at /usr/src/debug/ksh-20120801/src/cmd/ksh93/sh/xec.c:1265 > 1265 int n = sh_debug(shp,trap,(char*)0,(char*)0, com, ARG_RAW); > (gdb) p argn > $2 = 0 > (gdb) > > argn is == 0 here. The tip-off here is that nv_name clearly > returns a simple pointer to an array of characters, not an array > of pointers to arrays of characters as is evidenced by the fact > that the assignment is "*argv = nv_name(np);" not "argv = > nv_name(np);". Looking at the function nv_name proves that it > does indeed return a single pointer to an array of characters, > not a pointer to an array of pointers to arrays of characters. > Now, com is defined as a 'char **': > > 1002 char *cp=0, **com=0, *comn; > > (as it is expected to be in the calls that follow) also, that > argv is also defined as the effective equivalent a 'char **': > > 1237 static char *argv[1]; > > Yup, argv is actually an array of pointers (char ** equivalent), > but that array is restricted to having exactly one element. > Recalling the assignment in the previously quoted line: > > 1245 *argv = nv_name(np); > > we see that the one and only element in that argv array is > getting assigned a pointer to an array of characters here. > Nothing necessarily wrong with that, but remember the loop we > looked at earlier in frame att#2 (sh_debug). It went like: > > 543 while(cp = *argv++) > 544 { > 545 if((flags&ARG_EXP) && argv[1]==0) > 546 out_pattern(iop, cp,' '); > 547 else > 548 out_string(iop, cp,' ',n?0: (flags&(ARG_RAW|ARG_NOGLOB))||*argv); > 549 } > > which is clearly expecting argv in this context (com in frame 3, > which really points to that static local single element array > that is also pointed to by argv in frame 2) to be an array of > pointers of indefinite size, each element being a pointer, but > whose last element will be a null pointer. Well, in frame 3 it is > clearly an array with only a single element, and that one element > is NOT pointing to null. Watch this: > > (gdb) f 3 > att#3 0x000000000045a867 in sh_exec (t=0x7f2f13aafad0, flags=4) > at /usr/src/debug/ksh-20120801/src/cmd/ksh93/sh/xec.c:1265 > 1265 int n = sh_debug(shp,trap,(char*)0,(char*)0, com, ARG_RAW); > (gdb) p com > $8 = (char **) 0x76e060 > (gdb) p &argv > $9 = (char *(*)[1]) 0x76e060 > (gdb) p com[0] > $11 = 0x5009c6 "true" > (gdb) p com[1] > $10 = 0x1 <Address 0x1 out of bounds> > (gdb) p argv[0] > $12 = 0x5009c6 "true" > (gdb) p argv[1] > $13 = 0x1 <Address 0x1 out of bounds> > (gdb) > > So, as expected, com and &argv point to the same place, the first > element points to the constant string "true", but since the array > is defined as having only one element, when you refer to a second > element in that array, you get well, whatever random crap happens > to be in that memory location. When we try to reproduce this > problem, apparently we're getting 0 there (or we're not quite > following this same code path, which is also possible), but the > customer happens to have a "1" in that memory location. ===end analysis=== src/cmd/ksh93/sh/xec.c: sh_exec(): - When processing TCOM (simple command) with an empty/null command, increase the size of the static dummy argv[1] array to argv[2], ensuring a terminating NULL element so that 'while(cp = *argv++)' loops don't crash. (Note that static objects are automatically initialised to zero in C.) src/cmd/ksh93/tests/io.sh: - Adapt the reproducer, testing a null-command redirection 1000x.
Original patch: https://src.fedoraproject.org/rpms/ksh/blob/642af4d6/f/ksh-20140801-diskfull.patch Prior discussion: https://www.mail-archive.com/[email protected]/msg01037.html https://www.mail-archive.com/[email protected]/msg01038.html https://www.mail-archive.com/[email protected]/msg01042.html https://bugzilla.redhat.com/1212992 On Fri, 08 May 2015 14:37:45 -0700, Paulo Andrade wrote: > I have a user with a ksh crashing problem, and that has > some "Write error: No space left on device" messages > in /var/log/messages. > > After some debugging, and creating a chroot on a file > disk image, and a test user, and slowly filling the > "on file" filesystem, e.g. > > dd if=/dev/zero of=/mnt/tmp/zerosN bs=1M count=1024 > dd if=/dev/zero of=/mnt/tmp/zerosN bs=1K count=2 > > until leaving just around 12K, I managed to reproduce the > problem, and be able to debug it with valgrind and vgdb; > debugging on these conditions is tricky, as cannot tell > valgrind to spawn gdb, because then gdb itself would fail > to start. > > So, after following the code enough, I learned that at places > it handles SH_JMPEXIT, there was almost non existing > handling of SH_JMPERREXIT. > > ksh would evently cause a crash due to the struct > subshell allocated on stack, in sh/subshell.c:sh_subshell > kept set to the global subshell_data, after it siglongjmp > back the stack due to, not fully handling the out of disk > space errors. It would print a few messages, everytime > a pipe was created, e.g.: > > /etc/profile: line 28: write to 3 failed [No space left on device] > > until eventually crashing due to corrupted memory; e.g. the > references to stack data from sh_subsell in the global > subshell_data. One strange thing to me in coredump analysis > was that subshell_data prev field was pointing to itself when > it eventually crashed, what later was understood and expected... > > The attached patch handles SH_JMPERREXIT in the code > paths SH_JMPEXIT is handled, and the failed login, on > full disk, ends in a pause() call: > > ---terminal 1--- > $ valgrind -q --leak-check=full --free-fill=0x5a --vgdb=full > --vgdb-error=0 /bin/ksh -l > ==17730== (action at startup) vgdb me ... > ==17730== > ==17730== TO DEBUG THIS PROCESS USING GDB: start GDB like this > ==17730== /path/to/gdb /bin/ksh > ==17730== and then give GDB the following command > ==17730== target remote | /usr/lib64/valgrind/../../bin/vgdb --pid=17730 > ==17730== --pid is optional if only one valgrind process is running > ==17730== > ==17730== Syscall param mount(type) points to unaddressable byte(s) > ==17730== at 0x563377A: mount (in /usr/lib64/libc-2.17.so) > ==17730== by 0x493E58: fs3d_mount (fs3d.c:115) > ==17730== by 0x493C8B: fs3d (fs3d.c:57) > ==17730== by 0x423E41: sh_init (init.c:1302) > ==17730== by 0x405CD3: sh_main (main.c:141) > ==17730== by 0x405B84: main (pmain.c:45) > ==17730== Address 0x0 is not stack'd, malloc'd or (recently) free'd > ==17730== > ==17730== (action on error) vgdb me ... > ==17730== Continuing ... > /etc/profile: line 28: write to 3 failed [No space left on device] > ---8<--- > > ---terminal 2--- > (gdb) c > Continuing. > ^C > Program received signal SIGTRAP, Trace/breakpoint trap. > 0x00000000055fa470 in __pause_nocancel () from /lib64/libc.so.6 > (gdb) bt > #0 0x00000000055fa470 in __pause_nocancel () from /lib64/libc.so.6 > att#1 0x000000000041e73d in sh_done (ptr=0x793360 <sh>, sig=255) at > /home/pcpa/rhel/ksh/ksh-20120801/src/cmd/ksh93/sh/fault.c:665 > att#2 0x0000000000407407 in exfile (shp=0x4542, iop=0xff, fno=0) at > /home/pcpa/rhel/ksh/ksh-20120801/src/cmd/ksh93/sh/main.c:604 > att#3 0x0000000000405c43 in sh_source (shp=0x793360 <sh>, iop=0x0, > file=0x524804 <e_sysprofile> "/etc/profile") > at /home/pcpa/rhel/ksh/ksh-20120801/src/cmd/ksh93/sh/main.c:109 > att#4 0x00000000004060e4 in sh_main (ac=2, av=0xfff000498, userinit=0x0) > at /home/pcpa/rhel/ksh/ksh-20120801/src/cmd/ksh93/sh/main.c:202 > att#5 0x0000000000405b85 in main (argc=2, argv=0xfff000498) at > /home/pcpa/rhel/ksh/ksh-20120801/src/cmd/ksh93/sh/pmain.c:45 > (gdb) > ---8<---
No description provided.