Skip to content

Commit

Permalink
jobs.c: refactor SIGHUP handling; document bug fixed (re: 62cf88d)
Browse files Browse the repository at this point in the history
There is quite a bit of no-op code in the job_hup() function due
to conditions that always test false. This commit removes that code
and clarifies the rest, making the purpose of this function clear.

job_hup() (before 62cf88d: job_terminate()) is called via
job_walk() by sh_done() in fault.c to issue SIGHUP, the "hang up"
signal, to every background job's process group when the current
session is ungracefully disconnected. (One way to trigger such a
disconnection is to forcibly terminate a ssh session by typing '~.'
on a new prompt.)

The bug that Solaris patch 260-22964338 fixed is that ksh then
killed all non-disowned jobs' process groups without considering
that ksh still remembers a job even when all its processes are
finished (have the P_DONE flag). In that condition, the process
group ID may well be reused by another process by now, so it is
dangerous to killpg() it; we risk killing unrelated processes!

This is *not* a hypothetical problem; the Solaris patch exists
because this happened to a Solaris customer. However, the bug
exists on all operating systems. It's rarely triggered but serious,
and it's more likely to occur on heavy workloads that re-use
process/group IDs a lot. And it's on every currently released
non-Solaris version of ksh93. Eesh.

src/cmd/ksh93/sh/jobs.c:
src/cmd/ksh93/include/jobs.h:
- Remove job_terminate() which was unused as of 62cf88d.
  It could have been fixed instead of replaced. Oh well.
- Refactor job_hup():
  - Remove code that will never be executed because, at
    those points, it is known that pw->p_pgrp != 0.
  - Simplify the loop that checks that there is at least
    one non-P_DONE process so it doesn't need a flag.

For documentation purposes, below is a reproducer for the bug
before the Solaris patch. It is rather involved.

1. Compile the C program below (cpid).
2. In one terminal, 'ssh localhost'.
3. Within the ssh session:
   - 'exec -a-ksh /path/to/buggy/ksh' to get a ksh login shell.
   - 'sleep 1 &' and let it finish. Note down the reported PID.
     That is the one we will reuse. Let's say 26650.
4. In another terminal, run: ./cpid 26650 (the PID from the
   previous step). Now wait until it says "PID 26650 is ready"; it
   has now succeeded at re-using that PID, and will just sit there.
   This process will never voluntarily terminate. If we have the
   bug, the termination of this process will be the symptom.
5. In the first terminal, forcibly terminate the ssh session by
   typing, on a new prompt: ~. (tilde, dot). This triggers the
   buggy routine to issue SIGHUP to all of ksh's background jobs.
6. In the second terminal, the bug is reproduced if cpid has been
   terminated, reporting 'waitpid return 26650, status 0x0001', so
   ksh just killed this process that it had nothing to do with.
   (Note that status 0x0001 refers to being killed by signal 1
   which is SIGHUP.)

cpid.c follows (written by George Lijo, tweaked by me):

 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <signal.h>
 #include <sys/wait.h>

 int main(int argc, char *argv[])
 {
	pid_t	pid, rpid, opid;
	int	i, status, npid;
	if (argc != 2)
	{
		fprintf(stderr, "Usage: cpid <PID to re-use>\n");
		exit(1);
	}
	rpid = atoi(argv[1]);
	opid = getpid();
	for (;;)
	{
		if ((pid = fork()) == 0)
		{
			setpgrp();
			pause();
			_exit(0);
		}
		if (pid == rpid)
			break;
		kill(pid, SIGKILL);
		waitpid(pid, NULL, 0);
		if (opid < rpid && pid > rpid)
			printf("Cannot create PID %d\n", rpid);
		opid = pid;
	}
	printf("PID %d is ready\n", pid);
	i = waitpid(pid, &status, 0);
	printf("waitpid return %d, status 0x%4.4x\n", i, status);
	return status;
 }
  • Loading branch information
McDutchie committed Nov 25, 2021
1 parent c9d7310 commit 6d3796c
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 38 deletions.
1 change: 0 additions & 1 deletion src/cmd/ksh93/include/jobs.h
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,6 @@ extern void job_chldtrap(Shell_t*, const char*,int);
extern void job_init(Shell_t*,int);
extern int job_close(Shell_t*);
extern int job_list(struct process*,int);
extern int job_terminate(struct process*,int);
extern int job_hup(struct process *, int);
extern int job_switch(struct process*,int);
extern void job_fork(pid_t);
Expand Down
46 changes: 9 additions & 37 deletions src/cmd/ksh93/sh/jobs.c
Original file line number Diff line number Diff line change
Expand Up @@ -891,16 +891,6 @@ int job_walk(Sfio_t *file,int (*fun)(struct process*,int),int arg,char *joblist[
return(r);
}

/*
* send signal <sig> to background process group if not disowned
*/
int job_terminate(register struct process *pw,register int sig)
{
if(pw->p_pgrp && !(pw->p_flag&P_DISOWN))
job_kill(pw,sig);
return(0);
}

/*
* list the given job
* flag JOB_LFLAG for long listing
Expand Down Expand Up @@ -1113,8 +1103,8 @@ int job_kill(register struct process *pw,register int sig)
}

/*
* Similar to job_kill, but dedicated to SIGHUP handling when session is
* being disconnected.
* Kill process group with SIGHUP when the session is being disconnected.
* (As this is called via job_walk(), it must accept the 'sig' argument.)
*/
int job_hup(struct process *pw, int sig)
{
Expand All @@ -1123,37 +1113,19 @@ int job_hup(struct process *pw, int sig)
if(pw->p_pgrp == 0 || (pw->p_flag & P_DISOWN))
return(0);
job_lock();
if(pw->p_pgrp != 0)
/*
* Only kill process group if we still have at least one process. If all the processes are P_DONE,
* then our process group is already gone and its p_pgrp may now be used by an unrelated process.
*/
for(px = pw; px; px = px->p_nxtproc)
{
int palive = 0;
for(px = pw; px != NULL; px = px->p_nxtproc)
{
if((px->p_flag & P_DONE) == 0)
{
palive = 1;
break;
}
}
/*
* If all the processes have died, there is no guarantee that
* p_pgrp is still the valid process group that we made, i.e.,
* the PID may have been recycled and the same p_pgrp may have
* been assigned to unrelated processes.
*/
if(palive)
if(!(px->p_flag & P_DONE))
{
if(killpg(pw->p_pgrp, SIGHUP) >= 0)
job_unstop(pw);
break;
}
}
for(; pw != NULL && pw->p_pgrp == 0; pw = pw->p_nxtproc)
{
if(pw->p_flag & P_DONE)
continue;
if(kill(pw->p_pid, SIGHUP) >= 0)
(void)kill(pw->p_pid, SIGCONT);
pw = pw->p_nxtproc;
}
job_unlock();
return(0);
}
Expand Down

1 comment on commit 6d3796c

@McDutchie
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @lijog – this may be of interest.

Please sign in to comment.