Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
jobs.c: refactor SIGHUP handling; document bug fixed (re: 62cf88d)
There is quite a bit of no-op code in the job_hup() function due to conditions that always test false. This commit removes that code and clarifies the rest, making the purpose of this function clear. job_hup() (before 62cf88d: job_terminate()) is called via job_walk() by sh_done() in fault.c to issue SIGHUP, the "hang up" signal, to every background job's process group when the current session is ungracefully disconnected. (One way to trigger such a disconnection is to forcibly terminate a ssh session by typing '~.' on a new prompt.) The bug that Solaris patch 260-22964338 fixed is that ksh then killed all non-disowned jobs' process groups without considering that ksh still remembers a job even when all its processes are finished (have the P_DONE flag). In that condition, the process group ID may well be reused by another process by now, so it is dangerous to killpg() it; we risk killing unrelated processes! This is *not* a hypothetical problem; the Solaris patch exists because this happened to a Solaris customer. However, the bug exists on all operating systems. It's rarely triggered but serious, and it's more likely to occur on heavy workloads that re-use process/group IDs a lot. And it's on every currently released non-Solaris version of ksh93. Eesh. src/cmd/ksh93/sh/jobs.c: src/cmd/ksh93/include/jobs.h: - Remove job_terminate() which was unused as of 62cf88d. It could have been fixed instead of replaced. Oh well. - Refactor job_hup(): - Remove code that will never be executed because, at those points, it is known that pw->p_pgrp != 0. - Simplify the loop that checks that there is at least one non-P_DONE process so it doesn't need a flag. For documentation purposes, below is a reproducer for the bug before the Solaris patch. It is rather involved. 1. Compile the C program below (cpid). 2. In one terminal, 'ssh localhost'. 3. Within the ssh session: - 'exec -a-ksh /path/to/buggy/ksh' to get a ksh login shell. - 'sleep 1 &' and let it finish. Note down the reported PID. That is the one we will reuse. Let's say 26650. 4. In another terminal, run: ./cpid 26650 (the PID from the previous step). Now wait until it says "PID 26650 is ready"; it has now succeeded at re-using that PID, and will just sit there. This process will never voluntarily terminate. If we have the bug, the termination of this process will be the symptom. 5. In the first terminal, forcibly terminate the ssh session by typing, on a new prompt: ~. (tilde, dot). This triggers the buggy routine to issue SIGHUP to all of ksh's background jobs. 6. In the second terminal, the bug is reproduced if cpid has been terminated, reporting 'waitpid return 26650, status 0x0001', so ksh just killed this process that it had nothing to do with. (Note that status 0x0001 refers to being killed by signal 1 which is SIGHUP.) cpid.c follows (written by George Lijo, tweaked by me): #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <signal.h> #include <sys/wait.h> int main(int argc, char *argv[]) { pid_t pid, rpid, opid; int i, status, npid; if (argc != 2) { fprintf(stderr, "Usage: cpid <PID to re-use>\n"); exit(1); } rpid = atoi(argv[1]); opid = getpid(); for (;;) { if ((pid = fork()) == 0) { setpgrp(); pause(); _exit(0); } if (pid == rpid) break; kill(pid, SIGKILL); waitpid(pid, NULL, 0); if (opid < rpid && pid > rpid) printf("Cannot create PID %d\n", rpid); opid = pid; } printf("PID %d is ready\n", pid); i = waitpid(pid, &status, 0); printf("waitpid return %d, status 0x%4.4x\n", i, status); return status; }
- Loading branch information
6d3796c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @lijog – this may be of interest.