Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure child creation time is after parent creation time #1

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

nebel
Copy link

@nebel nebel commented Oct 1, 2023

Ensure child creation time is after parent creation time to avoid linking unrelated processes into the kill chain (caused by reused PPids).

See jesseduffield/lazygit#3008

I've marked this as a draft because I've never used go before and wasn't able to test with lazygit, but running this code in isolation seemed to fix the issue for me.

if err != nil {
return
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a defer CloseHandle(handle) here? I think this might leak otherwise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! I added a call to closeHandle(HANDLE(handle)) to reuse the existing function. I also noticed that windows.CloseHandle is available with the new import, though it can also return an error. Not sure if that way would be better or not.

Copy link

@Jim-with-a-J Jim-with-a-J Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not even a beginner with Go, so I'm not qualified to comment on best practice :) The only thing that stands out to me is that you never return anything explicitly, while the rest of the code does. I assume that's purely a stylistic choice and Go does something sensible with the returns (editing previous dumb comment, does it create local variables with the names in the trailing part of the function definition and return those?)

Copy link
Author

@nebel nebel Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that's purely a stylistic choice and Go does something sensible with the returns (editing previous dumb comment, does it create local variables with the names in the trailing part of the function definition and return those?)

Yes, I was also surprised at how this works but the names before the types in the function return type are local variables which will be used as return values when doing a blank return. Seems to be commonly used to avoid having to supply the full (ok, err) tuple at each return, and is used by GetProcs in this code for example. But I'm certainly not qualified to comment on best practices either.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the record I'm not a fan of naked returns (when you specify the variable name as part of the function signature) because they confuse everybody (myself included) who comes across them. I am aware this is my repo so it's strange that the GetProcs function uses naked returns but in truth I hadn't formed this opinion before the PR was originally made (it wasn't made by me)

@jesseduffield
Copy link
Owner

I'm not much of a windows guy (as you can tell by the fact that the existing code kills random processes) but what are your thoughts on the approach taken here? https://gist.github.com/hallazzang/76f3970bfc949831808bbebc8ca15209?permalink_comment_id=2948162#gistcomment-2948162

@Jim-with-a-J
Copy link

I'm not much of a windows guy (as you can tell by the fact that the existing code kills random processes) but what are your thoughts on the approach taken here? https://gist.github.com/hallazzang/76f3970bfc949831808bbebc8ca15209?permalink_comment_id=2948162#gistcomment-2948162

Using job objects does seem a lot cleaner - children are added to the job by default if the parent is, and you don't have to walk through the processes yourself. I had a go (ha!) at doing something like this, but I kept looking for a way to use cmd.process.handle, which is not accessible (I see the gist code "cheats" with unsafe operations to use that, while @nebel takes a more API-appropriate path and gets the handle again). It's a bit unfortunate that there needs to be some co-operation between the CreateProcess code and the killing code to support this though. With the spawning and killing handled in separate modules, it seems a bit ugly.

I also wonder if it's possible to have a race condition this way - if a child is spawned before the new process gets added to the job group, it might never get terminated. With the time it takes to spawn a process, this is probably not a huge concern though.

My thought was to start the process suspended until the job object was created and the new process added, and also use the PID to generate a job name (e.g. jduffield_kill_[pid]) so at least the kill interface didn't need to change. It could try to open a job object with that name, and if it failed fall back on the existing behaviour.

@nebel
Copy link
Author

nebel commented Oct 2, 2023

@Jim-with-a-J pretty much summed up my thoughts as well. I actually did test the code in the gist you linked, @jesseduffield, and confirmed that it does work--with the rather large caveat that to kill the process tree we don't actually invoke Kill (which works the same as usual for processes even in such a job) but rather need to call windows.CloseHandle on the job handle (which will kill the entire tree).

In addition to the race condition Jim mentioned (which I was also concerned about, however unlikely it seems) there is also the fact that in Windows 7 and earlier a process can't be in multiple jobs. I don't know if we're concerned about supporting Windows 7 but for example if lazygit itself was in a job (which is actually quite possible as the terminal could be in a job), trying to add an additional job to a child process of lazygit would fail. So in that case we might need to fall back to the old kill method.

But my main concern is just the overhead in additional code of doing things the correct way, as Jim mentioned. Probably we would need to wrap the mechanisms for both launching (to set the job) and killing (to close the job handle) in OS-specific code, and pass around e.g. a WrappedCmd which contains the necessary job handle on Windows. It's certainly not impossible but it's no longer a library-only change and would necessitate some larger changes in lazygit itself.

That said, I could try to implement it if you're interested. Being a novice at Go I don't know if my design would be the best, though, and it's also possible we'd need to keep this time-checking tree-diving version in here as a fallback for when job creation fails for some reason.

Let me know what you think.

@Jim-with-a-J
Copy link

Jim-with-a-J commented Oct 2, 2023

Is there any way to know how many other projects are using the kill library? Since co-operation is needed from the calling code (and there's still the potential for the job object approach to fail), I think probably this approach using the running time should still be added as the default behaviour. Any client that doesn't want to/can't immediately switch to using job objects doesn't have to, but they still benefit from not killing unrelated processes, and some extra time can be taken to figure out the best way to introduce job objects for Windows.

@nebel
Copy link
Author

nebel commented Oct 3, 2023

Here's an example of what an API for using jobs to kill the process tree might look like:

https://gist.github.com/nebel/ff407aa88e5a7b261f5d3c673340a884

It will fall back to the current kill approach if job creation or adding processes to a job fails.

Running the test will run cmd.exe from which you can run notepad.exe manually to check how things are being killed. And you can uncomment out various kill strategies here:

	//go waitKill(wrapped.cmd)
	//go waitKillTree(wrapped.cmd)
	go waitTerminate(&wrapped)

From my testing both waitKillTree and waitTerminate will terminate both processes while waitKill (simple kill) will not, for example.

I'm not sure what the non-Windows alternative would do. Should we set Setpgid? I noticed the current function to do that in the Linux side is not used, but I don't know why.

I will say though that I still don't particularly like how unwieldy this approach is, but if it's preferable then I think it could be used. Again though we'd still need to fix the existing kill method as we probably need it as a fallback.

@jesseduffield
Copy link
Owner

Great progress! Sorry, I haven't had much time to dive into the details here. But I have just now done some more googling to see if we've missed anything. Some assorted links:

Go issue about support windows job groups:
golang/go#17608

How buildkite handles killing child processes in windows:
buildkite/agent#879

Pulumi dealing with the same issue:
pulumi/pulumi#8696

Situation in lazygit where we failed to kill processes due to a race condition:
jesseduffield/lazygit#1265 (comment)

Also for the record I'm happy to ignore windows 7; I doubt many people are still using it. I'm also willing to require a change in lazygit in order to have the best outcome, though it is regrettable.

@nebel
Copy link
Author

nebel commented Oct 11, 2023

Thanks for researching. I took a look at those issues. At first the buildkite method seemed good since it works based on the PID alone (no need for wrapping) but after testing it, it seems like it relies on all subprocesses reacting to the break and shutting themselves down, which may work but I don't know if it's correct to rely on it. It's really more about broadcasting a signal to so- called console processes rather than actually killing them.

The pulumi situation is also a little different in that they control their subprocesses so they just ended up polling them via gRPC I guess.

At this point I feel like the job solution with a fallback of killing the tree "manually" (but checking process creation time) is probably still the most robust solution, even though it does mean having a wrapper for cmd.

I also saw this library https://github.com/kolesnikovae/go-winjob which has examples in which they take the extra step of starting a process suspended then resuming after the job is set, in order to guarantee it can't exit or spawn new processes before the job is assigned. This is probably an even more robust approach, but also increases the complexity further (not so much in the suspend part but certainly in the resume part).

@nebel
Copy link
Author

nebel commented Oct 11, 2023

I would also like to add that this PR in its current state, while not a perfect solution, should dramatically improve the upon the current situation. At least it should make the "killing unrelated long-running processes" problem extremely unlikely--without making any other existing problems any worse.

@sibouras
Copy link

any news on this

@nebel
Copy link
Author

nebel commented Oct 4, 2024

Note that I have been using a custom build including these changes on Windows for several months without issue.

@nebel nebel marked this pull request as ready for review October 4, 2024 05:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants