jobs.c: refactor SIGHUP handling; document bug fixed (re: 62cf88d)

There is quite a bit of no-op code in the job_hup() function due to conditions that always test false. This commit removes that code and clarifies the rest, making the purpose of this function clear. job_hup() (before 62cf88d: job_terminate()) is called via job_walk() by sh_done() in fault.c to issue SIGHUP, the "hang up" signal, to every background job's process group when the current session is ungracefully disconnected. (One way to trigger such a disconnection is to forcibly terminate a ssh session by typing '~.' on a new prompt.) The bug that Solaris patch 260-22964338 fixed is that ksh then killed all non-disowned jobs' process groups without considering that ksh still remembers a job even when all its processes are finished (have the P_DONE flag). In that condition, the process group ID may well be reused by another process by now, so it is dangerous to killpg() it; we risk killing unrelated processes! This is *not* a hypothetical problem; the Solaris patch exists because this happened to a Solaris customer. However, the bug exists on all operating systems. It's rarely triggered but serious, and it's more likely to occur on heavy workloads that re-use process/group IDs a lot. And it's on every currently released non-Solaris version of ksh93. Eesh. src/cmd/ksh93/sh/jobs.c: src/cmd/ksh93/include/jobs.h: - Remove job_terminate() which was unused as of 62cf88d. It could have been fixed instead of replaced. Oh well. - Refactor job_hup(): - Remove code that will never be executed because, at those points, it is known that pw->p_pgrp != 0. - Simplify the loop that checks that there is at least one non-P_DONE process so it doesn't need a flag. For documentation purposes, below is a reproducer for the bug before the Solaris patch. It is rather involved. 1. Compile the C program below (cpid). 2. In one terminal, 'ssh localhost'. 3. Within the ssh session: - 'exec -a-ksh /path/to/buggy/ksh' to get a ksh login shell. - 'sleep 1 &' and let it finish. Note down the reported PID. That is the one we will reuse. Let's say 26650. 4. In another terminal, run: ./cpid 26650 (the PID from the previous step). Now wait until it says "PID 26650 is ready"; it has now succeeded at re-using that PID, and will just sit there. This process will never voluntarily terminate. If we have the bug, the termination of this process will be the symptom. 5. In the first terminal, forcibly terminate the ssh session by typing, on a new prompt: ~. (tilde, dot). This triggers the buggy routine to issue SIGHUP to all of ksh's background jobs. 6. In the second terminal, the bug is reproduced if cpid has been terminated, reporting 'waitpid return 26650, status 0x0001', so ksh just killed this process that it had nothing to do with. (Note that status 0x0001 refers to being killed by signal 1 which is SIGHUP.) cpid.c follows (written by George Lijo, tweaked by me): #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <signal.h> #include <sys/wait.h> int main(int argc, char *argv[]) { pid_t pid, rpid, opid; int i, status, npid; if (argc != 2) { fprintf(stderr, "Usage: cpid <PID to re-use>\n"); exit(1); } rpid = atoi(argv[1]); opid = getpid(); for (;;) { if ((pid = fork()) == 0) { setpgrp(); pause(); _exit(0); } if (pid == rpid) break; kill(pid, SIGKILL); waitpid(pid, NULL, 0); if (opid < rpid && pid > rpid) printf("Cannot create PID %d\n", rpid); opid = pid; } printf("PID %d is ready\n", pid); i = waitpid(pid, &status, 0); printf("waitpid return %d, status 0x%4.4x\n", i, status); return status; }
ksh93 · Nov 25, 2021 · 6d3796c · 6d3796c · McDutchie · Nov 26, 2021
1 parent c9d7310
commit 6d3796c
Show file tree

Hide file tree

Showing 2 changed files with 9 additions and 38 deletions.
diff --git a/src/cmd/ksh93/include/jobs.h b/src/cmd/ksh93/include/jobs.h
@@ -178,7 +178,6 @@ extern void	job_chldtrap(Shell_t*, const char*,int);
 	extern void	job_init(Shell_t*,int);
 	extern int	job_close(Shell_t*);
 	extern int	job_list(struct process*,int);
-	extern int	job_terminate(struct process*,int);
 	extern int	job_hup(struct process *, int);
 	extern int	job_switch(struct process*,int);
 	extern void	job_fork(pid_t);

diff --git a/src/cmd/ksh93/sh/jobs.c b/src/cmd/ksh93/sh/jobs.c
@@ -891,16 +891,6 @@ int job_walk(Sfio_t *file,int (*fun)(struct process*,int),int arg,char *joblist[
 	return(r);
 }
 
-/*
- * send signal <sig> to background process group if not disowned
- */
-int job_terminate(register struct process *pw,register int sig)
-{
-	if(pw->p_pgrp && !(pw->p_flag&P_DISOWN))
-		job_kill(pw,sig);
-	return(0);
-}
-
 /*
  * list the given job
  * flag JOB_LFLAG for long listing
@@ -1113,8 +1103,8 @@ int job_kill(register struct process *pw,register int sig)
 }
 
 /*
- * Similar to job_kill, but dedicated to SIGHUP handling when session is
- * being disconnected.
+ * Kill process group with SIGHUP when the session is being disconnected.
+ * (As this is called via job_walk(), it must accept the 'sig' argument.)
  */
 int job_hup(struct process *pw, int sig)
 {
@@ -1123,37 +1113,19 @@ int job_hup(struct process *pw, int sig)
 	if(pw->p_pgrp == 0 || (pw->p_flag & P_DISOWN))
 		return(0);
 	job_lock();
-	if(pw->p_pgrp != 0)
+	/*
+	 * Only kill process group if we still have at least one process. If all the processes are P_DONE,
+	 * then our process group is already gone and its p_pgrp may now be used by an unrelated process.
+	 */
+	for(px = pw; px; px = px->p_nxtproc)
 	{
-		int	palive = 0;
-		for(px = pw; px != NULL; px = px->p_nxtproc)
-		{
-			if((px->p_flag & P_DONE) == 0)
-			{
-				palive = 1;
-				break;
-			}
-		}
-		/*
-		 * If all the processes have died, there is no guarantee that
-		 * p_pgrp is still the valid process group that we made, i.e.,
-		 * the PID may have been recycled and the same p_pgrp may have
-		 * been assigned to unrelated processes.
-		 */
-		if(palive)
+		if(!(px->p_flag & P_DONE))
 		{
 			if(killpg(pw->p_pgrp, SIGHUP) >= 0)
 				job_unstop(pw);
+			break;
 		}
 	}
-	for(; pw != NULL && pw->p_pgrp == 0; pw = pw->p_nxtproc)
-	{
-		if(pw->p_flag & P_DONE)
-			continue;
-		if(kill(pw->p_pid, SIGHUP) >= 0)
-			(void)kill(pw->p_pid, SIGCONT);
-		pw = pw->p_nxtproc;
-	}
 	job_unlock();
 	return(0);
 }