-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Linux-specific pidfd process extensions #77168
Conversation
r? @dtolnay (rust_highfive has picked a reviewer for you, use r? to override) |
This currently uses my fork of |
/// A pidfd will only be created if it is possible to do so | ||
/// in a guaranteed race-free manner (e.g. if the `clone3` system call is | ||
/// supported). Otherwise, [`ChildExit::pidfd`] will return an error. | ||
fn create_pidfd(&mut self, val: bool) -> &mut process::Command; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the user need to request it explicitly? As long as clone3 is available is there any cost to always obtaining it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always opening a pidfd would cause it to be leaked, since we don't ever close it in the standard library. This would eventually exhaust all of the file descriptors for the process with leaked pidfds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it could be closed when the Child
is dropped and one would have to take
it out if one wants to keep using it, but yeah that would indeed be an extra cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ownership and moving would be appropriate here since someone has to be responsible for closing it and if one can retrieve copies it becomes unclear who is responsible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Without seeing this discussion, I would presume that Process owns the pidfd and will close it on drop.
- The pidfd should be closed when clone3 succeeds but exec fails, since a user has no ability to do so on their own, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without seeing this discussion, I would presume that Process owns the pidfd and will close it on drop.
If we do that, then I think we should add a way to prevent this behavior (other then calling mem::forget
on the Child
). Otherwise, users who just need the pidfd will be forced to potentially leak memory (if Child
ever heap allocates internally), or keep a Child
around when it's not actually needed.
The pidfd should be closed when clone3 succeeds but exec fails, since a user has no ability to do so on their own, right?
The pidfd only exists in the parent process. Once clone3
succeeds, the pidfd is perfectly useable, regardless of what happens in the child process from that point on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pidfd only exists in the parent process. Once clone3 succeeds, the pidfd is perfectly useable, regardless of what happens in the child process from that point on.
My point was that spawn returns an error in that case, so the user cannot access pidfd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do that, then I think we should add a way to prevent this behavior (other then calling mem::forget on the Child).
To keep things RAII we should introduce a new wrapper type akin to other fd-wrapping IO types such as File
, TcpStream
and so on. E.g. struct ProcessDescriptor(RawFd)
which could implement Drop
.
Then Child
could have
take_pidfd(&mut self) -> ProcessDescriptor
pidfd(&self) -> &ProcessDescriptor
ProcessDescriptor
then could implement IntoRawFd
and AsRawFd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was that spawn returns an error in that case, so the user cannot access pidfd.
Oh, that's a good point. I'll fix that.
/// some other error occured. | ||
/// | ||
/// See `man pidfd_open` for more details about pidfds. | ||
fn pidfd(&self) -> Result<i32>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not Result<RawFd>
?
// or if the pidfd could not be created for some reason | ||
// (e.g. the `clone3` syscall was not available). | ||
#[cfg(target_os = "linux")] | ||
pidfd: libc::c_int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using RawFd
for this.
I'd love to have this! Thanks for implementing it. |
let args_ptr = &mut args as *mut clone_args; | ||
let args_size = crate::mem::size_of::<clone_args>(); | ||
|
||
let res = cvt(unsafe { clone3(args_ptr, args_size) }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't run atfork
callbacks. This is for example used by rand
to force all ReseedingRng
instances to reseed themselves. (https://github.com/rust-random/rand/blob/42247d8616b0f8fb62d8a5a6e8dd8f6a17e1b3eb/src/rngs/adapter/reseeding.rs#L287) The rand
usage is not relevant here, but there may be other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought that posix_spawnnp
didn't run atfork
handlers, but it looks like I was mistaken.
This will require some further thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that I used __clone3
is that glibc's clone()
implementation requires the caller to allocate a stack for the new process (even though the kernel allows passing a null stack pointer to the raw clone
syscall). We could allocate a stack ourself, but I think it would be better to avoid that if at all possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the posix_spawn
man page states:
According to POSIX, it is unspecified whether fork handlers
established with pthread_atfork(3) are called when posix_spawn() is
invoked. Since glibc 2.24, the fork handlers are not executed in any
case. On older implementations, fork handlers are called only if the
child is created using fork(2).
so I think this should be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some extensive discussion going on in glibc upstream about handling and wrapping clone3. See:
On Tue, Sep 29, 2020 at 10:59:21AM -0700, Aaron Hill wrote:
@Aaron1011 commented on this pull request.
> + pidfd: &mut pidfd as *mut libc::pid_t as u64,
+ child_tid: 0,
+ parent_tid: 0,
+ exit_signal: libc::SIGCHLD as u64,
+ stack: 0,
+ stack_size: 0,
+ tls: 0,
+ set_tid: 0,
+ set_tid_size: 0,
+ cgroup: 0
+ };
+
+ let args_ptr = &mut args as *mut clone_args;
+ let args_size = crate::mem::size_of::<clone_args>();
+
+ let res = cvt(unsafe { clone3(args_ptr, args_size) });
The reason that I used `__clone3` is that glibc's `clone()` implementation requires the caller to allocate a stack for the new process (even though the kernel allows passing a null stack pointer to the raw `clone` syscall). We could allocate a stack ourself, but I think it would be better to avoid that if at all possible.
There's an ongoing discussion in the glibc community (which included a
session in Linux Plumbers Conference this year) about handling clone3
and having it integrate better with glibc, including signal handling,
atfork, and similar.
|
@@ -267,7 +345,6 @@ impl Command { | |||
#[cfg(any( | |||
target_os = "macos", | |||
target_os = "freebsd", | |||
all(target_os = "linux", target_env = "gnu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we continue using posix_spawn on GNU/Linux, while adding pidfd to a list of exceptions? It should end up using CLONE_VFORK & CLONE_VM, and looking at various benchmarks it can make substantial performance difference for some use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clone3 implementation could use CLONE_VFORK to achieve the same.
For reference, Go seems to pull it off while only having to jump though one hoop to deal with the shared memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using posix_spawn
would require us to create the pidfd in the child process, and then use a Unix domain socket to transfer it to the parent. Using clone3
with the proper flags seems much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When pidfd is not used, so there should be no impact on the implementation. See #77455 for an additional motivation.
☔ The latest upstream changes (presumably #77462) made this pull request unmergeable. Please resolve the merge conflicts. Note that reviewers usually do not review pull requests until merge conflicts are resolved! Once you resolve the conflicts, you should change the labels applied by bors to indicate that your PR is ready for review. Post this as a comment to change the labels:
|
Background: Over the last year, pidfd support was added to the Linux kernel. This allows interacting with other processes. In particular, this allows waiting on a child process with a timeout in a race-free way, bypassing all of the awful signal-handler tricks that are usually required. Pidfds can be obtained for a child process (as well as any other process) via the `pidfd_open` syscall. Unfortunately, this requires several conditions to hold in order to be race-free (i.e. the pid is not reused). Per `man pidfd_open`: ``` · the disposition of SIGCHLD has not been explicitly set to SIG_IGN (see sigaction(2)); · the SA_NOCLDWAIT flag was not specified while establishing a han‐ dler for SIGCHLD or while setting the disposition of that signal to SIG_DFL (see sigaction(2)); and · the zombie process was not reaped elsewhere in the program (e.g., either by an asynchronously executed signal handler or by wait(2) or similar in another thread). If any of these conditions does not hold, then the child process (along with a PID file descriptor that refers to it) should instead be created using clone(2) with the CLONE_PIDFD flag. ``` Sadly, these conditions are impossible to guarantee once any libraries are used. For example, C code runnng in a different thread could call `wait()`, which is impossible to detect from Rust code trying to open a pidfd. While pid reuse issues should (hopefully) be rare in practice, we can do better. By passing the `CLONE_PIDFD` flag to `clone()` or `clone3()`, we can obtain a pidfd for the child process in a guaranteed race-free manner. This PR: This PR adds Linux-specific process extension methods to allow obtaining pidfds for processes spawned via the standard `Command` API. Other than being made available to user code, the standard library does not make use of these pidfds in any way. In particular, the implementation of `Child::wait` is completely unchanged. Two Linux-specific helper methods are added: `CommandExt::create_pidfd` and `ChildExt::pidfd`. These methods are intended to serve as a building block for libraries to build higher-level abstractions - in particular, waiting on a process with a timeout. I've included a basic test, which verifies that pidfds are created iff the `create_pidfd` method is used. This test is somewhat special - it should always succeed on systems with the `clone3` system call available, and always fail on systems without `clone3` available. I'm not sure how to best ensure this programatically. This PR relies on the newer `clone3` system call to pass the `CLONE_FD`, rather than the older `clone` system call. `clone3` was added to Linux in the same release as pidfds, so this shouldn't unnecessarily limit the kernel versions that this code supports. Unresolved questions: * What should the name of the feature gate be for these newly added methods? * Should the `pidfd` method distinguish between an error occurring and `create_pidfd` not being called?
ad0da42
to
b3dfcc5
Compare
/// Os-specific extensions to [`process::Child`] | ||
/// | ||
/// [`process::Child`]: crate::process::Child | ||
pub trait ChildExt { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this a sealed trait (like https://doc.rust-lang.org/std/slice/trait.SliceIndex.html) so it cannot be implemented outside of the standard library and we are free to add methods.
/// Os-specific extensions to [`process::Command`] | ||
/// | ||
/// [`process::Command`]: crate::process::Command | ||
pub trait CommandExt { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this a sealed trait.
@@ -0,0 +1,47 @@ | |||
//! Linux-specific extensions to primitives in the `std::process` module. | |||
|
|||
#![unstable(feature = "linux_pidfd", issue = "none")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open a tracking issue and update this attribute before landing.
Putting a review comment as a reminder; doesn't need to be done now if the PR is not in shape to land yet.
The public API of Reassigning to r? @joshtriplett to help review the implementation. |
Co-authored-by: bjorn3 <[email protected]>
if HAS_CLONE3.load(Ordering::Relaxed) { | ||
let mut flags = 0; | ||
if self.make_pidfd { | ||
flags |= CLONE_PIDFD; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the error path for "the caller asked for a pidfd but the kernel doesn't support clone3"?
(In theory we could have a fallback that tries pidfd_open
, but that's less safe for a variety of reasons, so let's just not.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the Some(libc::ENOSYS) =>
branch below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did see that; I just missed that you'd initialized pidfd
to -1 at the top, so that it's safe to always return it. That seems fine.
OK, this LGTM. r=me when you're ready.
@Aaron1011 Can you please make the sealed-trait changes @dtolnay requested, as well as opening the tracking issue? |
@joshtriplett Sorry for the delay - I'll do that today or tomorrow. |
On Fri, Oct 23, 2020 at 02:17:41PM -0700, Aaron Hill wrote:
@joshtriplett Sorry for the delay - I'll do that today or tomorrow.
Not rushing you, just wanted to make sure the next steps were all in one
place at the bottom of the issue.
|
Additional open issues:
|
Triage: What's the current status of this? |
@Aaron1011 Ping from traige. I'm closing this due to inactivity. Feel free to reopen or create a new PR when you have time to work on this again, thanks! |
Add Linux-specific pidfd process extensions (take 2) Continuation of rust-lang#77168. I addressed the following concerns from the original PR: - make `CommandExt` and `ChildExt` sealed traits - wrap file descriptors in `PidFd` struct representing ownership over the fd - add `take_pidfd` to take the fd out of `Child` - close fd when dropped Tracking Issue: rust-lang#82971
Add Linux-specific pidfd process extensions (take 2) Continuation of rust-lang#77168. I addressed the following concerns from the original PR: - make `CommandExt` and `ChildExt` sealed traits - wrap file descriptors in `PidFd` struct representing ownership over the fd - add `take_pidfd` to take the fd out of `Child` - close fd when dropped Tracking Issue: rust-lang#82971
Add Linux-specific pidfd process extensions (take 2) Continuation of rust-lang#77168. I addressed the following concerns from the original PR: - make `CommandExt` and `ChildExt` sealed traits - wrap file descriptors in `PidFd` struct representing ownership over the fd - add `take_pidfd` to take the fd out of `Child` - close fd when dropped Tracking Issue: rust-lang#82971
Add Linux-specific pidfd process extensions (take 2) Continuation of rust-lang#77168. I addressed the following concerns from the original PR: - make `CommandExt` and `ChildExt` sealed traits - wrap file descriptors in `PidFd` struct representing ownership over the fd - add `take_pidfd` to take the fd out of `Child` - close fd when dropped Tracking Issue: rust-lang#82971
Add Linux-specific pidfd process extensions (take 2) Continuation of rust-lang#77168. I addressed the following concerns from the original PR: - make `CommandExt` and `ChildExt` sealed traits - wrap file descriptors in `PidFd` struct representing ownership over the fd - add `take_pidfd` to take the fd out of `Child` - close fd when dropped Tracking Issue: rust-lang#82971
Add Linux-specific pidfd process extensions (take 2) Continuation of rust-lang#77168. I addressed the following concerns from the original PR: - make `CommandExt` and `ChildExt` sealed traits - wrap file descriptors in `PidFd` struct representing ownership over the fd - add `take_pidfd` to take the fd out of `Child` - close fd when dropped Tracking Issue: rust-lang#82971
Background:
Over the last year, pidfd support was added to the Linux kernel. This
allows interacting with other processes. In particular, this allows
waiting on a child process with a timeout in a race-free way, bypassing
all of the awful signal-handler tricks that are usually required.
Pidfds can be obtained for a child process (as well as any other
process) via the
pidfd_open
syscall. Unfortunately, this requiresseveral conditions to hold in order to be race-free (i.e. the pid is not
reused).
Per
man pidfd_open
:Sadly, these conditions are impossible to guarantee once any libraries
are used. For example, C code runnng in a different thread could call
wait()
, which is impossible to detect from Rust code trying to open apidfd.
While pid reuse issues should (hopefully) be rare in practice, we can do
better. By passing the
CLONE_PIDFD
flag toclone()
orclone3()
, wecan obtain a pidfd for the child process in a guaranteed race-free
manner.
This PR:
This PR adds Linux-specific process extension methods to allow obtaining
pidfds for processes spawned via the standard
Command
API. Other thanbeing made available to user code, the standard library does not make
use of these pidfds in any way. In particular, the implementation of
Child::wait
is completely unchanged.Two Linux-specific helper methods are added:
CommandExt::create_pidfd
and
ChildExt::pidfd
. These methods are intended to serve as a buildingblock for libraries to build higher-level abstractions - in particular,
waiting on a process with a timeout.
I've included a basic test, which verifies that pidfds are created iff
the
create_pidfd
method is used. This test is somewhat special - itshould always succeed on systems with the
clone3
system callavailable, and always fail on systems without
clone3
available. I'mnot sure how to best ensure this programatically.
This PR relies on the newer
clone3
system call to pass theCLONE_FD
,rather than the older
clone
system call.clone3
was added to Linuxin the same release as pidfds, so this shouldn't unnecessarily limit the
kernel versions that this code supports.
Unresolved questions:
methods?
pidfd
method distinguish between an error occurringand
create_pidfd
not being called?