Support running persistent workers remotely #10091

tsiq-charliem · 2019-10-23T17:05:23Z

ATTENTION! Please read and follow:

if this is a question about how to build / test / query / deploy using Bazel, or a discussion starter, send it to [email protected]

if this is a bug or feature request, fill the form below as best as you can.

Description of the problem / feature request:

Looking for support for running persistent workers on remote hosts.

Feature requests: what underlying problem are you trying to solve with this feature?

Currently, actions can be run with the remote strategy or the worker strategy, but we'd like a way to get the benefits of a persistent worker on a remote build. Without this, our local builds with persistent workers outperform remote builds.

What operating system are you running Bazel on?

Ubuntu 16.04

What's the output of `bazel info release`?

1.0.0

The text was updated successfully, but these errors were encountered:

irengrig · 2019-10-28T09:15:24Z

/cc @buchgr

buchgr · 2019-10-28T09:38:39Z

@tsiq-charliem there's the https://blog.bazel.build/2019/02/01/dynamic-spawn-scheduler.html that will get you the best of both.

We currently have no plans for to work on worker support for remote execution.

meisterT · 2020-08-04T09:52:47Z

Closing as we don't plan to add persistent worker support for remote execution.

ulfjack · 2020-08-12T11:50:32Z

I have a patch for this. It's pretty small, so it may be acceptable to check it in?

meisterT · 2020-08-12T11:52:29Z

Interesting, how does that look like?

ulfjack · 2020-08-12T11:59:54Z

Patch is here: https://github.com/ulfjack/bazel/tree/remote-persistent-worker

meisterT · 2020-08-12T12:15:46Z

cc @coeuvre @larsrc-google

Add a new --experimental_remote_mark_tool_inputs flag, which makes Bazel tag tool inputs when executing actions remotely, and also adds a tools input key to the platform proto sent as part of the remote execution request. This allows a remote execution system to implement persistent workers, i.e., to keep worker processes around and reuse them for subsequent actions. In a trivial example, this improves build performance by ~3x. We use "persistentWorkerKey" for the platform property, with the value being a hash of the tool inputs, and "bazel_tool_input" as the node property name, with an empty string as value (this is just a boolean tag). Implements bazelbuild#10091. Change-Id: Iccb36081fee399855be7c487c2d4091cb36f8df3

ulfjack · 2020-10-15T21:47:25Z

Support for remote persistent workers is one of our most requested features and we've seen significant performance improvements in some real-world scenarios with proprietary codebases. I've rebased my change to HEAD, but I still need to add some tests.

bergsieker · 2020-10-16T19:57:39Z

Your patch seems reasonable as far as adding the tool signature to the Platform. As far as I read it, it's just adding the hash of the tool paths, not trying to get a stronger signature like the digest of the binaries, right? It's possible that splitting into multiple cache keys, one for each referenced tool, might be desirable because it gives the server more scheduling flexibility in terms of prioritizing one tool over another, but I'm not sure how frequent multi-tool actions are so it may not make much difference in practice.

I'm a little skeptical of adding built-in support for this in Bazel without understanding what a workable server implementation looks like. When we've noodled around on this in the past, it's been hard to come up with something that was safe, flexible enough to handle varying workloads and multiple tools, and that provided reasonable affordances for debugging. Does this ultimately break down to per-worker pool targeting, combined with some server functionality to keep the tools up, allow for resets, etc.?

ulfjack · 2020-10-17T21:36:24Z

Bazel already supports workers with a single worker 'tool' with a specific API (actually, there are two APIs - the vanilla API and the multiplex API). This PR only annotates the remote execution requests with just enough information to be able to implement the same API remotely. Note that it is safe for a server to ignore this information, and just continue as usual. Also note that this is behind an experimental flag.

There are any number of ways to implement this on the server-side. Per-worker pool is one way, although that doesn't seem very appealing to me. Generally speaking, we have found it straightforward to keep track of the most recent 'persistent worker key' for each worker and assign actions to a matching worker if possible.

It may be necessary to overprovision worker resources to allow the scheduler sufficient leeway in assigning actions to workers. Certainly, a first-come-first-serve scheduler will struggle if there is queueing as it won't be able to make meaningful decisions. However, a scheduler could also delay actions (say for a few hundred ms) or reorder the first few queue entries to generate more options. In the first case, there is a chance that another better-matching worker instance becomes available during that time. In the reordering case, there is a chance that another better-matching action is near the front of the queue (but probably requires a safe-guard to prevent actions from being skipped indefinitely).

On the positive side, we've seen performance improvements even if we can only find matching workers for a small percentage of actions. There's basically no downside to providing the extra information - the performance is virtually identical to the non-persistent-worker case if we can't ever schedule an action to a matching instance. We have also seen cases where moving to remote builds without persistent workers is a significant performance regression compared to local builds because the action graph is not sufficiently wide (and given the inherent overhead of remote execution), and local builds already use persistent workers.

Finally, people seem to be happy to enable this without particular regard to safety or security given the significant benefits we're seeing on the performance side. Given that people are happy to use remote caching (which has strictly worse safety and security), I find this entirely unsurprising. Debugging hasn't been an issue for us so far, maybe because persistent workers are already widely used for local execution.

Add a new --experimental_remote_mark_tool_inputs flag, which makes Bazel tag tool inputs when executing actions remotely, and also adds a tools input key to the platform proto sent as part of the remote execution request. This allows a remote execution system to implement persistent workers, i.e., to keep worker processes around and reuse them for subsequent actions. In a trivial example, this improves build performance by ~3x. We use "persistentWorkerKey" for the platform property, with the value being a hash of the tool inputs, and "bazel_tool_input" as the node property name, with an empty string as value (this is just a boolean tag). Implements bazelbuild#10091. Change-Id: Iccb36081fee399855be7c487c2d4091cb36f8df3

larsrc-google · 2020-11-17T17:09:09Z

Ulf, could you turn the above into a little design doc and attach it to a PR for this change?

Add a new --experimental_remote_mark_tool_inputs flag, which makes Bazel tag tool inputs when executing actions remotely, and also adds a tools input key to the platform proto sent as part of the remote execution request. This allows a remote execution system to implement persistent workers, i.e., to keep worker processes around and reuse them for subsequent actions. In a trivial example, this improves build performance by ~3x. We use "persistentWorkerKey" for the platform property, with the value being a hash of the tool inputs, and "bazel_tool_input" as the node property name, with an empty string as value (this is just a boolean tag). Implements bazelbuild#10091. Change-Id: Iccb36081fee399855be7c487c2d4091cb36f8df3

larsrc-google · 2021-02-11T10:32:48Z

@EricBurnett

Ulf, are you far enough along with this that you could do a design doc? Eric is concerned that getting remote workers to be safe and correct is not that easy, but it would be a great feature.

EricBurnett · 2021-02-11T21:01:21Z

+1 for a design doc. Ideally with some discussion on whether this is needed in all cases, or only to reduce latency on user-driven incremental builds - I'd be much less concerned if e.g. caching was disabled for remote-worker actions and there was no cross-user sharing of workers, as that'd significantly reduce the blast radius of issues while potentially keeping all the interesting benefits?

Given that people are happy to use remote caching (which has strictly worse safety and security), I find this entirely unsurprising.

FWIW most groups I've worked with that enabled remote caching have poisoned their cache at least once, so I'd still suggest having a response plan for dealing with that :).

Add a new --experimental_remote_mark_tool_inputs flag, which makes Bazel tag tool inputs when executing actions remotely, and also adds a tools input key to the platform proto sent as part of the remote execution request. This allows a remote execution system to implement persistent workers, i.e., to keep worker processes around and reuse them for subsequent actions. In a trivial example, this improves build performance by ~3x. We use "persistentWorkerKey" for the platform property, with the value being a hash of the tool inputs, and "bazel_tool_input" as the node property name, with an empty string as value (this is just a boolean tag). Implements bazelbuild#10091. Change-Id: Iccb36081fee399855be7c487c2d4091cb36f8df3

ulfjack · 2021-03-06T00:27:42Z

I finally wrote a design doc: bazelbuild/proposals#219

Add a new --experimental_remote_mark_tool_inputs flag, which makes Bazel tag tool inputs when executing actions remotely, and also adds a tools input key to the platform proto sent as part of the remote execution request. This allows a remote execution system to implement persistent workers, i.e., to keep worker processes around and reuse them for subsequent actions. In a trivial example, this improves build performance by ~3x. We use "persistentWorkerKey" for the platform property, with the value being a hash of the tool inputs, and "bazel_tool_input" as the node property name, with an empty string as value (this is just a boolean tag). Implements bazelbuild#10091. Change-Id: Iccb36081fee399855be7c487c2d4091cb36f8df3

wiwa · 2022-01-26T23:22:00Z

Would it be helpful for remexec backends to implement their ends of remote persistent workers if they could depend on //src/main/java/com/google/devtools/build/lib/worker? i.e. should worker be visible through @bazel_tools?

Add a new --experimental_remote_mark_tool_inputs flag, which makes Bazel tag tool inputs when executing actions remotely, and also adds a tools input key to the platform proto sent as part of the remote execution request. This allows a remote execution system to implement persistent workers, i.e., to keep worker processes around and reuse them for subsequent actions. In a trivial example, this improves build performance by ~3x. We use "persistentWorkerKey" for the platform property, with the value being a hash of the tool inputs, and "bazel_tool_input" as the node property name, with an empty string as value (this is just a boolean tag). Implements bazelbuild#10091. Change-Id: Iccb36081fee399855be7c487c2d4091cb36f8df3

wiwa · 2022-04-20T07:51:45Z

@ulfjack While playing around with your patch, I noted that the change doesn't fully specify the initial --persistent_worker command, meaning that each remote worker would have to know this ahead of time. I was able to get Bazel to spit out this information from the WorkerKey you exposed in a small extra change here: wiwa@d9c7ae9

Unfortunately, I had to stuff it in the Platform's additionalProperties field. What do you think about this addition? Is there a better way to specify "extra metadata" on remote execution requests? I see that your (accepted) proposal has additional data which Bazel sends. In my change, I was thinking that ToolSignature could be a subset of WorkerKey. Instead, now I'm thinking that ToolSignature would be... exactly WorkerKey, or at least, the info that we're missing from it (for example, we already have the info of the input files from the file marker).

Splitting it out of the main worker package makes re-using the code for other implementations for dispatching requests to workers (e.g., for remote persistent workers, bazelbuild#10091) easier.

ittaiz · 2022-07-11T05:11:40Z

What's the status of this?
The design doc was merged a year ago. Did Ulf's patch make it in somehow?

meisterT · 2022-07-11T07:08:10Z

Did Ulf's patch make it in somehow?

Not yet, but we started looking into this again just last week. We have to play around with the patch to see if it needs any change.

cc @sadaf-matinkhoo @larsrc-google

Add a new --experimental_remote_mark_tool_inputs flag, which makes Bazel tag tool inputs when executing actions remotely, and also adds a tools input key to the platform proto sent as part of the remote execution request. This allows a remote execution system to implement persistent workers, i.e., to keep worker processes around and reuse them for subsequent actions. In a trivial example, this improves build performance by ~3x. We use "persistentWorkerKey" for the platform property, with the value being a hash of the tool inputs, and "bazel_tool_input" as the node property name, with an empty string as value (this is just a boolean tag). Implements bazelbuild#10091. Change-Id: Iccb36081fee399855be7c487c2d4091cb36f8df3

Add a new `--experimental_remote_mark_tool_inputs` flag, which makes Bazel tag tool inputs when executing actions remotely, and also adds a tools input key to the platform proto sent as part of the remote execution request. This allows a remote execution system to implement persistent workers, i.e., to keep worker processes around and reuse them for subsequent actions. In a trivial example, this improves build performance by ~3x. We use "persistentWorkerKey" for the platform property, with the value being a hash of the tool inputs, and "bazel_tool_input" as the node property name, with an empty string as value—this is just a boolean tag. Fixes bazelbuild#10091. Co-authored-by: Ulf Adams <[email protected]>

Add a new --experimental_remote_mark_tool_inputs flag, which makes Bazel tag tool inputs when executing actions remotely, and also adds a tools input key to the platform proto sent as part of the remote execution request. This allows a remote execution system to implement persistent workers, i.e., to keep worker processes around and reuse them for subsequent actions. In a trivial example, this improves build performance by ~3x. We use "persistentWorkerKey" for the platform property, with the value being a hash of the tool inputs, and "bazel_tool_input" as the node property name, with an empty string as value (this is just a boolean tag). Implements bazelbuild#10091. Change-Id: Iccb36081fee399855be7c487c2d4091cb36f8df3

Add a new --experimental_remote_mark_tool_inputs flag, which makes Bazel tag tool inputs when executing actions remotely, and also adds a tools input key to the platform proto sent as part of the remote execution request. This allows a remote execution system to implement persistent workers, i.e., to keep worker processes around and reuse them for subsequent actions. In a trivial example, this improves build performance by ~3x. We use "persistentWorkerKey" for the platform property, with the value being a hash of the tool inputs, and "bazel_tool_input" as the node property name, with an empty string as value (this is just a boolean tag). Implements bazelbuild#10091. Change-Id: Iccb36081fee399855be7c487c2d4091cb36f8df3 (cherry picked from commit 526fb58)

Add a new `--experimental_remote_mark_tool_inputs` flag, which makes Bazel tag tool inputs when executing actions remotely, and also adds a tools input key to the platform proto sent as part of the remote execution request. This allows a remote execution system to implement persistent workers, i.e., to keep worker processes around and reuse them for subsequent actions. In a trivial example, this improves build performance by ~3x. We use "persistentWorkerKey" for the platform property, with the value being a hash of the tool inputs, and "bazel_tool_input" as the node property name, with an empty string as value—this is just a boolean tag. Fixes bazelbuild#10091. Co-authored-by: Ulf Adams <[email protected]> Closes bazelbuild#16362. PiperOrigin-RevId: 482433908 Change-Id: I2a80834731fd0169c08d4beea348f61a323ca028

irengrig added team-Remote-Exec Issues and PRs for the Execution (Remote) team untriaged labels Oct 28, 2019

buchgr added P3 We're not considering working on this, but happy to review a PR. (No assignee) and removed untriaged labels Oct 28, 2019

meisterT closed this as completed Aug 4, 2020

meisterT reopened this Aug 12, 2020

coeuvre added the type: feature request label Dec 9, 2020

Yannic mentioned this issue May 1, 2022

[worker] Move WorkerProtocolImpl into its own package #15381

Closed

benjaminp mentioned this issue Sep 30, 2022

Experimentally support remote persistent workers. #16362

Closed

wiwa mentioned this issue Oct 14, 2022

Introduce Persistent Workers buildfarm/buildfarm#1195

Merged

copybara-service bot closed this as completed in 72b481a Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support running persistent workers remotely #10091

Support running persistent workers remotely #10091

tsiq-charliem commented Oct 23, 2019

irengrig commented Oct 28, 2019

buchgr commented Oct 28, 2019

meisterT commented Aug 4, 2020

ulfjack commented Aug 12, 2020

meisterT commented Aug 12, 2020

ulfjack commented Aug 12, 2020

meisterT commented Aug 12, 2020

ulfjack commented Oct 15, 2020

bergsieker commented Oct 16, 2020

ulfjack commented Oct 17, 2020

larsrc-google commented Nov 17, 2020

larsrc-google commented Feb 11, 2021

EricBurnett commented Feb 11, 2021

ulfjack commented Mar 6, 2021

wiwa commented Jan 26, 2022

wiwa commented Apr 20, 2022

ittaiz commented Jul 11, 2022

meisterT commented Jul 11, 2022

Support running persistent workers remotely #10091

Support running persistent workers remotely #10091

Comments

tsiq-charliem commented Oct 23, 2019

Description of the problem / feature request:

Feature requests: what underlying problem are you trying to solve with this feature?

What operating system are you running Bazel on?

What's the output of bazel info release?

irengrig commented Oct 28, 2019

buchgr commented Oct 28, 2019

meisterT commented Aug 4, 2020

ulfjack commented Aug 12, 2020

meisterT commented Aug 12, 2020

ulfjack commented Aug 12, 2020

meisterT commented Aug 12, 2020

ulfjack commented Oct 15, 2020

bergsieker commented Oct 16, 2020

ulfjack commented Oct 17, 2020

larsrc-google commented Nov 17, 2020

larsrc-google commented Feb 11, 2021

EricBurnett commented Feb 11, 2021

ulfjack commented Mar 6, 2021

wiwa commented Jan 26, 2022

wiwa commented Apr 20, 2022

ittaiz commented Jul 11, 2022

meisterT commented Jul 11, 2022

What's the output of `bazel info release`?