Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

Use gentler subprocess termination semantics #857

Merged
merged 4 commits into from
Jul 21, 2017

Conversation

sdboyer
Copy link
Member

@sdboyer sdboyer commented Jul 20, 2017

What does this do / why do we need it?

This should help address the overwhelming majority of reports of cache corruption, which occur as a result of Ctrl-c causing git to exit abruptly and leaving a corrupted repository state on disk. Given how common it is for people to Ctrl-c during their first dep init while dep is cloning down bunches of repos, this could help a lot. As @natefinch put it in #823:

Even though dep is actually doing something, it's a long process, and thus it looks like it is hanging. Because it looks hung, people hit ctrl-C, which then corrupts the dep cache, and screws up further dep init attempts.

What should your reviewer look out for in this PR?

Note that this PR sidesteps the problem for Windows. I'm hoping one of our Windows folks might be able to do a follow-up on that side - @ChrisHines? @mattn? (I did see this)

Do you need help or clarification on anything?

I really don't like checking cmd.ProcessState != nil as a way of determining if the process has exited, but that was the most obvious solution that jumped out at me.

If folks could give this a shot to see if their issues go away, that'd be great. Of course, proving there ISN'T an problem through empirical tests is not fully possible, so just best-effort would be fine.

Which issue(s) does this PR fix?

fixes #469
fixes #790
fixes #409
fixes #823 (well enough for now, at least)

This allows us to rely on gentler termination semantics (os.Interrupt)
for UNIX-y systems, as such interrupts do not work on Windows.
Terminations are triggered by e.g. ctrl-C interrupts. Relying generally
on Process.Kill() for such purposes was causing git to often exit in
unclean ways, resulting in a dirty cache dir and putting users in an
awkward position.

We maintain the invariant that we ensure the process IS either exited,
or forcibly Process.Kill()ed, before exiting monitoredCmd.run(). This
allows us to ensure we do not have orphan subprocesses that exist beyond
the lifetime of the parent sourceMgr.
@natefinch
Copy link

Not sure if this was discussed elsewhere, but would it be easier to just retain kill semantics and simply delete the folder git was working on? Then you don't have to worry about what state git leaves the repo in, you just assume if it's not done, it's not in a good state.

@mattn
Copy link
Member

mattn commented Jul 20, 2017

If you hope to terminate all of child processes, you've better to use taskkill with /T

exec.Command("taskkill", "/F", "/T", "/PID", fmt.Sprint(cmd.Process.Pid)).Run()

Or call TerminateProcess recursibly.

https://github.com/mattn/psutil/blob/master/psutil_windows.go#L53

@sdboyer
Copy link
Member Author

sdboyer commented Jul 20, 2017

Not sure if this was discussed elsewhere, but would it be easier to just retain kill semantics and simply delete the folder git was working on? Then you don't have to worry about what state git leaves the repo in, you just assume if it's not done, it's not in a good state.

I think git is quite careful to ensure it cleans up after itself - possibly even incrementally, so that progress is saved? idk. anyway. It might end up being safer to blow away dirs with in-process activity in them when a cancellation is received. But this isn't simple:

  1. The monitoredCmd framework is used to manage ALL subprocess invocations. As such, innocuous things, like e.g. git ls-remote, which literally cannot change any local state, could end up being causes for killing a local repo. That's potentially a lot of collateral damage.
  2. We don't really have much in the way of code right now for detecting this kind of thing on the return trip back out from a command invocation, so it'd be new code paths to check it and do the removal. Probably not a huge deal, but it's also a new conceptual span of time in which work might be done, which adds more to complexity than the simplicity of the operation might suggest on its own. So, in that particular sense, it's not really "easier," though it may be in the bigger picture.

@mattn cool, i'll have a look tomorrow. any options for terminating just a specific child process, or does it have to be all of them at once?

@ChrisHines
Copy link
Contributor

I've seen it be the case that trying to call TerminateProcess recursively is racy.

As the name implies, syscall.CreateToolhelp32Snapshot is a snapshot and will miss child processes created after the snapshot is taken, but before all the processes are killed. I've seen this happen in real life with C++ code that looked a lot like what @mattn linked to. Maybe @mattn's implementation avoids this, but I'm not sure.

I once had a need to ensure process trees were fully killed on Windows and the solution that worked for me was to use Windows Job Objects to get the OS to do all the hard work. That API is not currently wrapped by the standard library (golang/go#17608), but @alexbrainman has a package that has (almost?) everything we need at github.com/alexbrainman/ps (and already with a Go license).

Depending on exactly how we want this feature to work, github.com/alexbrainman/ps may need some additions. It is missing a wrapper for SetInformationJobObject which we might need to enable the setting of the JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE flag, but maybe we don't need that for dep.

@mattn
Copy link
Member

mattn commented Jul 20, 2017

AFAIK, SetInformationJobObject is not required. Simply create job handle with ps.CreateJobObject("", nil), and assign process with jo.AssignProcess(hProcess). To terminal job, jo.Terminate(0).

Note: As you know, using Job Object may remain files or directories that should be deleted.

@sdboyer sdboyer changed the title Use gentler termination semantics on subprocesses Use gentler subprocess termination semantics Jul 20, 2017
@sdboyer
Copy link
Member Author

sdboyer commented Jul 20, 2017

great discussion guys, thank you. i'm gonna ask we move it to #862, so that we can wrap up the UNIX-related parts here.

@sdboyer
Copy link
Member Author

sdboyer commented Jul 20, 2017

oh, no wonder i was uncomfortable with the ProcessStatus check - that creates data races!

@carolynvs
Copy link
Collaborator

@sdboyer I pulled down this PR and added 4 fmt.Println statements so I could tell which case was being selected in the process monitoring loop.

Copy link
Collaborator

@carolynvs carolynvs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping that pure channel magic could have handled this, but I see why you used the atomic package.

I tested this out and beat the hell out of it. I was able to reproduce other bugs which aren't covered by this PR: the canary panic, existing locks, etc. But didn't have any trouble with linger git processes or corrupt caches.

@@ -90,6 +130,7 @@ func (c *monitoredCmd) combinedOutput(ctx context.Context) ([]byte, error) {
return c.stderr.buf.Bytes(), err
}

// FIXME(sdboyer) this is not actually combined output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂

// until a message comes through the channel indicating that the command has
// exited.
//
// TODO(sdboyer) if the signaling process errored (resulting in a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's possible. Not sure what to do about it if a hard kill doesn't do its job. Seems like at that point it's out of our hands?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i'd tend to agree. i still want to leave the TODO just as a reminder, and in the event that someone who knows more about this sorta thing can come along and say something definitive. but i agree, it seems mostly out of our hands for now.

@sdboyer
Copy link
Member Author

sdboyer commented Jul 21, 2017

awesome, thank you for testing! yeah, i maybe could have accomplished this just with channels, but it would've been tremendously awkward. i have no problem with using sync/atomic when it's called for, and this rather seemed like one of those situations. even though it makes the killProcess() signature kinda ugly, passing an *int32 and all.

merging...

@sdboyer sdboyer merged commit 3781a6f into golang:master Jul 21, 2017
@ibrasho
Copy link
Collaborator

ibrasho commented Jul 22, 2017

@sdboyer 👍

Did you have time to look at #779 ? That was a minor case of failure for the TestMonitoredCmd too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
7 participants