Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF worker: look into removing unneeded dependencies #649

Closed
mrcnski opened this issue Apr 23, 2023 · 5 comments · Fixed by #2106
Closed

PVF worker: look into removing unneeded dependencies #649

mrcnski opened this issue Apr 23, 2023 · 5 comments · Fixed by #2106
Assignees

Comments

@mrcnski
Copy link
Contributor

mrcnski commented Apr 23, 2023

ISSUE

Overview

As part of #882, we want to block unneeded syscalls in the PVF workers. After #650 we will have separate worker binaries, whose object files we can analyze for included syscalls. However, right now a lot of unnecessary syscalls are found present in the object files.

This issue is about removing some big dependencies1 and seeing if it has an impact on syscalls or binary size, or if Rust is smart enough that it's already optimizing as much as it can. (Will make sure to try this in combination with enabling LTO.)

Footnotes

  1. i.e. we can remove the dependency on polkadot-node-core-pvf by moving that into e.g. host and common objects into common, so we have the following structure: pvf/host, pvf/worker, pvf/common. Then see how that affects binary size.

@mrcnski
Copy link
Contributor Author

mrcnski commented May 1, 2023

Auditing the dependencies, it looks like we can almost get rid of tokio, which would be a big win here, except that tracing-gum uses tokio heavily.

Since tracing-gum itself is a heavy dependency it would be nice to replace it, without actually removing logging altogether. Maybe we can send all logs back to the host over the socket, and have the host log them?1

And/or we can have simple text logs specific to the workers. cc @sandreim

Footnotes

  1. It wouldn't work for the threads (which don't do any logging right now anyway, and probably shouldn't), since we want to sandbox them and block socket calls.

mrcnski referenced this issue in paritytech/polkadot May 1, 2023
1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed.

2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1]

[^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet.

3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop.

4. The order of thread selection was flipped to make (3) sound (see note in code).

I left some TODO's related to panics which I'm going to address soon as part of #7045.
@sandreim
Copy link
Contributor

sandreim commented May 1, 2023

How much of a win is getting rid of tracing dependency in this case ? What is the cost of a replacement ? I agree that tracing-gum is not really needed here.

@mrcnski
Copy link
Contributor Author

mrcnski commented May 1, 2023

Don't have exact numbers, but tracing-gum seems to pull in quite a lot of dependencies when I run cargo tree -e normal, including networking (?). Getting rid of it and tokio would probably reduce the file size significantly and reduce the syscalls found in the binary. I plan to get some hard numbers once I get around to this issue.

What is the cost of a replacement ?

If we send logs over a socket, then the cost is just another socket I suppose. Alternatively we could reuse the same socket and use a prefix to detect if the message from the worker is a log or a response.

@slumber
Copy link
Contributor

slumber commented May 2, 2023

How much of a win is getting rid of tracing dependency in this case ? What is the cost of a replacement ? I agree that tracing-gum is not really needed here.

tracing-gum depends on jaeger, and it's quite heavy https://github.com/paritytech/polkadot/blob/master/node/jaeger/Cargo.toml

@bkchr
Copy link
Member

bkchr commented May 2, 2023

The dependency on jaeger is just one function, that could just be copied.

paritytech-processbot bot referenced this issue in paritytech/polkadot May 16, 2023
* PVF: Remove `rayon` and some uses of `tokio`

1. We were using `rayon` to spawn a superfluous thread to do execution, so it was removed.

2. We were using `rayon` to set a threadpool-specific thread stack size, and AFAIK we couldn't do that with `tokio` (it's possible [per-runtime](https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.thread_stack_size) but not per-thread). Since we want to remove `tokio` from the workers [anyway](https://github.com/paritytech/polkadot/issues/7117), I changed it to spawn threads with the `std::thread` API instead of `tokio`.[^1]

[^1]: NOTE: This PR does not totally remove the `tokio` dependency just yet.

3. Since `std::thread` API is not async, we could no longer `select!` on the threads as futures, so the `select!` was changed to a naive loop.

4. The order of thread selection was flipped to make (3) sound (see note in code).

I left some TODO's related to panics which I'm going to address soon as part of #7045.

* PVF: Vote invalid on panics in execution thread (after a retry)

Also make sure we kill the worker process on panic errors and internal errors to
potentially clear any error states independent of the candidate.

* Address a couple of TODOs

Addresses a couple of follow-up TODOs from
#7153.

* Add some documentation to implementer's guide

* Fix compile error

* Fix compile errors

* Fix compile error

* Update roadmap/implementers-guide/src/node/utility/candidate-validation.md

Co-authored-by: Andrei Sandu <[email protected]>

* Address comments + couple other changes (see message)

- Measure the CPU time in the prepare thread, so the observed time is not
  affected by any delays in joining on the thread.

- Measure the full CPU time in the execute thread.

* Implement proper thread synchronization

Use condvars i.e. `Arc::new((Mutex::new(true), Condvar::new()))` as per the std
docs.

Considered also using a condvar to signal the CPU thread to end, in place of an
mpsc channel. This was not done because `Condvar::wait_timeout_while` is
documented as being imprecise, and `mpsc::Receiver::recv_timeout` is not
documented as such. Also, we would need a separate condvar, to avoid this case:
the worker thread finishes its job, notifies the condvar, the CPU thread returns
first, and we join on it and not the worker thread. So it was simpler to leave
this part as is.

* Catch panics in threads so we always notify condvar

* Use `WaitOutcome` enum instead of bool condition variable

* Fix retry timeouts to depend on exec timeout kind

* Address review comments

* Make the API for condvars in workers nicer

* Add a doc

* Use condvar for memory stats thread

* Small refactor

* Enumerate internal validation errors in an enum

* Fix comment

* Add a log

* Fix test

* Update variant naming

* Address a missed TODO

---------

Co-authored-by: Andrei Sandu <[email protected]>
@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
mrcnski added a commit that referenced this issue Oct 24, 2023
Starting the tokio runtime was calling `socketpair` and triggering the new
seccomp filter. Removed tokio since we wanted to do it soon anyway as part of
#649.
@mrcnski mrcnski moved this to In Progress in parachains team board Oct 31, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Deployed on Kusama in parachains team board Nov 1, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in PVF Security Hardening Nov 1, 2023
@mrcnski mrcnski moved this from Deployed on Kusama to Deployed on Polkadot in parachains team board Nov 7, 2023
@the-right-joyce the-right-joyce moved this from Deployed on Polkadot to Completed in parachains team board Nov 29, 2023
bkchr pushed a commit that referenced this issue Apr 10, 2024
* do not refetch completion data instantly

* clippy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

5 participants