Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

PVF worker: Mitigate process leaks #7296

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 26 additions & 8 deletions node/core/pvf/common/src/worker.rs
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,15 @@ pub fn worker_event_loop<F, Fut>(
F: FnMut(UnixStream) -> Fut,
Fut: futures::Future<Output = io::Result<Never>>,
{
// Use `PR_SET_PDEATHSIG` to ensure that the child is sent a kill signal when the parent dies.
//
// NOTE: This technically has a race as the parent may have already died. In that case we will
// fail to read the socket later and just shutdown.
#[cfg(linux)]
if libc::prctl(libc::PR_SET_PDEATHSIG, libc::SIGKILL, 0, 0, 0) != 0 {
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning: the "parent" in this case is considered to be the
              thread that created this process.  In other words, the
              signal will be sent when that thread terminates (via, for
              example, [pthread_exit(3)](https://man7.org/linux/man-pages/man3/pthread_exit.3.html)), rather than after all of the
              threads in the parent process terminate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AKA, this isn't doing what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, hmm, I think since threads have their own pid this still addresses the race identified in this PR, I think.

// NOTE: On non-Linux this has a race condition between getting the pid and sending the
// signal -- the parent may have died and another process been assigned the same pid. On
// Linux this is not a problem -- we use `PR_SET_PDEATHSIG` to ensure that the child is sent
// a kill signal when the parent dies.
// [...]
let ppid = libc::getppid();
if ppid > 1 {
    libc::kill(ppid, libc::SIGTERM);
}

So the issue moves up a level, i.e. it becomes about avoiding leaked threads in the parent as opposed to leaked/orphaned child processes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure which race you now mean, the one in the comment above or that we don't kill workers on node exit?

Copy link
Contributor Author

@mrcnski mrcnski May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the race between getting the pid and sending the kill signal is addressed even with the caveat you posted from man.

And as you pointed out threads will exit when the process does, so the caveat doesn't actually change anything. So I think PR_SET_PDEATHSIG works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still somewhat equivalent, pvf-future is spawned on blocking pool, it can only be terminated along with a thread from the pool.

Why do we spawn this on a blocking pool?

But yeah it would need to be documented that this is a hard requirement now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @slumber I'm not really following - maybe my brain is running slow, can you explain more? I'm also not sure we should document implementation details too much, PR_SET_PDEATHSIG should work regardless of how the host is implemented.

I also removed some of the comments for now. Maybe documenting such rare corner cases was overkill, it just leads to noise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we spawn this on a blocking pool?

I honestly don't know.

can you explain more?

Basti quoted documentation mentioning that your prctl call will terminate worker once parent thread terminates, not the node itself. Suppose pvf-future spawns worker and moves to a different tokio thread (tokio workers steal work). This thread terminating would mean that a worker would get killed, and, as Basti mentioned, you're not doing what you wanted to.

Then I noted that pvf future in fact is spawned on a blocking pool. Essentially, tokio::spawn_blocking(block_on(pvf_host)). Given this, your PR works. This thread terminating is equivalent of terminating pvf host, which is sufficient for worker termination. If we replace spawn_blocking with spawn, your PR doesn't work anymore, this should be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you are correct. I'm not sure how I feel about relying on it being a blocking pool, since that fact can change.

Does work-stealing mean that the thread that spawned work can actually die while the work is still active? That would surprise me since it seems unsound (precisely for cases like PR_SET_PDEATHSIG), but TBH I don't know how tokio's scheduler works under the hood.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Work stealing means each worker thread has its own queue and they steal tasks from each other, "work stealing" is a term used by golang scheduler and tokio, you can read about it.

The main point is that task can move to another thread. For example, pvf future is on thread A and blocked, thread A steals another future and panics.


let worker_pid = std::process::id();
gum::debug!(target: LOG_TARGET, %worker_pid, "starting pvf worker ({})", debug_id);

Expand All @@ -93,7 +102,7 @@ pub fn worker_event_loop<F, Fut>(
%worker_pid,
"Node and worker version mismatch, node needs restarting, forcing shutdown",
);
kill_parent_node_in_emergency();
kill_parent_node_in_emergency(worker_pid);
mrcnski marked this conversation as resolved.
Show resolved Hide resolved
let err: io::Result<Never> =
Err(io::Error::new(io::ErrorKind::Unsupported, "Version mismatch"));
gum::debug!(target: LOG_TARGET, %worker_pid, "quitting pvf worker({}): {:?}", debug_id, err);
Expand Down Expand Up @@ -174,20 +183,29 @@ pub fn stringify_panic_payload(payload: Box<dyn Any + Send + 'static>) -> String
}
}

/// In case of node and worker version mismatch (as a result of in-place upgrade), send `SIGTERM`
/// to the node to tear it down and prevent it from raising disputes on valid candidates. Node
/// restart should be handled by the node owner. As node exits, unix sockets opened to workers
/// get closed by the OS and other workers receive error on socket read and also exit. Preparation
/// jobs are written to the temporary files that are renamed to real artifacts on the node side, so
/// no leftover artifacts are possible.
fn kill_parent_node_in_emergency() {
/// In case of node and worker version mismatch (as a result of in-place upgrade or
/// incorrectly-built binaries), send `SIGTERM` to the node to tear it down and prevent it from
/// raising disputes on valid candidates. Node restart should be handled by the node owner. As node
/// exits, unix sockets opened to workers get closed by the OS and other workers receive error on
/// socket read and also exit. Preparation jobs are written to the temporary files that are renamed
/// to real artifacts on the node side, so no leftover artifacts are possible.
fn kill_parent_node_in_emergency(worker_pid: u32) {
unsafe {
// NOTE: On non-Linux this has a race condition between getting the pid and sending the
// signal -- the parent may have died and another process been assigned the same pid. On
// Linux this is not a problem -- we use `PR_SET_PDEATHSIG` to ensure that the child is sent
// a kill signal when the parent dies.
//
// SAFETY: `getpid()` never fails but may return "no-parent" (0) or "parent-init" (1) in
// some corner cases, which is checked. `kill()` never fails.
let ppid = libc::getppid();
if ppid > 1 {
libc::kill(ppid, libc::SIGTERM);
} else {
gum::error!(target: LOG_TARGET, %worker_pid, "unexpected ppid {}", ppid);
}

libc::kill(worker_pid as i32, libc::SIGKILL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why do it? Not saying we should not, but there should be some reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was just to be explicit, and so we don't rely on coincidental architectural details that may change.

But, maybe we should instead have it be explicit that the worker dies when it can't read from the socket. The parent may also die for some reason other than us killing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say, if we already die when read fails, let's leave it at that, but document that error handling accordingly.

}
}

Expand Down