Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a lease service to support remote cache eviction #16660

Closed
wants to merge 1 commit into from

Conversation

coeuvre
Copy link
Member

@coeuvre coeuvre commented Nov 4, 2022

Design doc

Fixes #10880.

Working towards #8250.

@coeuvre coeuvre requested a review from alexjski November 8, 2022 09:55
@coeuvre
Copy link
Member Author

coeuvre commented Nov 8, 2022

@alexjski: this is not ready yet but want to get some inputs from you about the changes to FilesystemValueChecker.

There are two ways to check the TTL for a remote metadata:

  1. We call an API provided by the remote service when checking, cache the query result.
  2. We agree on a const TTL with the service ahead of time (e.g. use a flag --remote-cache-ttl=3600s). When we get the metadata from the remote service, we assume the service will keep the file alive at least for 3600s.

This PR goes with option 2. We save the TTL into the remote metadata and check whether it is still alive before we use it.

@coeuvre
Copy link
Member Author

coeuvre commented Nov 8, 2022

I learnt that some servers would like to set the TTL to 0 due to storage constrains. In this case, for an incremental build, this change makes Bazel ignore cached remote metadata if the files didn't get downloaded in the during previous build and check for the remote cache again.

@coeuvre coeuvre force-pushed the remote-metadata-ttl branch from 548098e to b541120 Compare November 9, 2022 14:30
Comment on lines +44 to +52
defaultValue = "2h",
documentationCategory = OptionDocumentationCategory.REMOTE,
effectTags = {OptionEffectTag.EXECUTION},
converter = RemoteDurationConverter.class,
help = "The guaranteed minimal age of blobs in the remote cache after their digests are "
+ "recently referenced e.g. by an ActionResult. Bazel does several optimizations based on "
+ "the blobs' age e.g. doesn't repeatedly call GetActionResult in an incremental build. "
+ "The value should be set slightly less than the real age since there is a gap between "
+ "when the server returns the digests and when Bazel receives them."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should say what setting this to 0 does. Also, I think the default should be 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add more details and what the behaviour of setting this to 0 after I nail down all the implementation details.

I agree 2h is probably not a good default but 0 isn't neither. Probably we should default to null which means Bazel assume the blobs will never expire (just like what it does today).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree that null is a good default. It would mean the reported bug isn't fixed for people unless they set a value here. I think 0 is the best default: bazel will recheck on each build. If you want to optimize it, because you know your remote cache can keep things around for a certain time, you can increase it.

Having 0 as a default is a nice middle ground: the default situation means people no longer run into the bug, which makes BwtB much more usable, while allowing remote cache maintainers to advise on a value to set for their users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting to 0 will have extremely bad incremental build performance: Bazel have to invalidate all in-memory cache and call GetActionResult on every action.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once per build, right?

If it can't be 0, then a low non-zero value is better than null. In that case I'm supportive of a couple hours. Maybe have it default to whatever the value of https://bazel.build/reference/command-line-reference#flag--max_idle_secs is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once per build, right?

yes, once per action per build.

If it can't be 0, then a low non-zero value is better than null. In that case I'm supportive of a couple hours. Maybe have it default to whatever the value of https://bazel.build/reference/command-line-reference#flag--max_idle_secs is?

Not sure. I will find a good default value once this PR is finalized.

Copy link
Contributor

@alexjski alexjski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a general concern that for slow builds, a file can expire in the middle of it? We do check the files at the beginning, but that doesn't seem to be updating the leases. We can somehow pad that by setting the option pretty low, but overall it is unsound as long as nothing centrally manages lease renewals for those.

@@ -591,12 +602,13 @@ public boolean equals(Object o) {
return Arrays.equals(digest, that.digest)
&& size == that.size
&& locationIndex == that.locationIndex
&& Objects.equals(actionId, that.actionId);
&& Objects.equals(actionId, that.actionId)
&& expiredAtEpochMilli == that.expiredAtEpochMilli;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Does that increase the size of this object? If so, it may be a good time to split actionId and this timestamp to a new subclass of its own.
  2. Can an action rerun renew this? If so, will that not negatively affect downstream action through no skyframe pruning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Does that increase the size of this object? If so, it may be a good time to split actionId and this timestamp to a new subclass of its own.

We plan to remove actionId. I can do it in a separate PR. Does that sound good to you? I am open to create a subclass too.

2. Can an action rerun renew this? If so, will that not negatively affect downstream action through no skyframe pruning?

Yes. Imaging the metadata expired in an incremental build, if the file doesn't exist locally, skyframe will invalidate the action node and re-evaluate it. The action cache will be skipped because we also check the expiration there. So the generating action will be re-executed, hitting the remote cache and inject a remote metadata with new expiration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We plan to remove actionId. I can do it in a separate PR. Does that sound good to you? I am open to create a subclass too.

My take is that I would strongly recommend running one of the large builds upon import and checking memory impact.

So the generating action will be re-executed, hitting the remote cache and inject a remote metadata with new expiration.

Let me get back to my question. After such execution, with the same results, we will not evaluate the ActionExecutionValue as equal for downstream actions due your newly added condition. Therefore, change-pruning will not kick in, even though it could. You will get an action cache hit for downstream actions, but it should be unnecessary to check that at all for those had change-pruning worked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Maybe exclude expiration from equals?

* Returns {@code true} if the file is remote and is available remotely at time {@code epochMilli}
* which is the milliseconds since epoch.
*/
public boolean isRemotelyAvailable(long epochMilli) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever call that on a FileArtifactValue for which we don't know it is a RemoteFileArtifactValue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I can move this into RemoteFileArtifactValue.

@@ -453,7 +462,8 @@ private boolean artifactIsDirtyWithDirectSystemCalls(
boolean trustRemoteValue =
fileMetadata.getType() == FileStateType.NONEXISTENT
&& lastKnownData.isRemote()
&& trustRemoteArtifacts;
&& trustRemoteArtifacts
&& lastKnownData.isRemotelyAvailable(epochMilli);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is, that looks like in the internal case, that will mark 100% of remote values as requiring rebuild, which would be unacceptable (unless we use Long.MAX_VALUE there). I would recommend running some tests against that change.

Copy link
Member Author

@coeuvre coeuvre Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am aware of this and will update the internal code during import to make sure they behave the same as today.

@brentleyjones
Copy link
Contributor

During the REAPI meeting we all agreed that it's assumed that build artifacts will continue to exist for the duration of a build. It's on the remote cache to ensure that is the case.

@coeuvre
Copy link
Member Author

coeuvre commented Nov 10, 2022

Is there a general concern that for slow builds, a file can expire in the middle of it?

This is not the concern for this PR at least. Like Brentley mentioned above, most remote cache will ensure the files are available during the build. Bazel will complain and crash if remote cache failed to that. Action rewinding will improve that but in the worst case, it will result in a endless loop (or reach the max rewinding time and crash).

We do check the files at the beginning, but that doesn't seem to be updating the leases.

We renew the lease by invalidating the skyframe node, by-passing action cache and re-executing the action. Within the execution, we either check remote cache (and get the cache hit if the files are still available) or re-execute the action remotely. In both cases, we inject the metadata with a new expiration.

@alexjski
Copy link
Contributor

Is there a general concern that for slow builds, a file can expire in the middle of it?

This is not the concern for this PR at least. Like Brentley mentioned above, most remote cache will ensure the files are available during the build. Bazel will complain and crash if remote cache failed to that. Action rewinding will improve that but in the worst case, it will result in a endless loop (or reach the max rewinding time and crash).

Can you explain the principle behind what guarantees that the files will remain for the duration of the build?

most remote cache will ensure the files are available during the build

Is there code which tracks all remote metadata and proactively renews the leases with RBE in Bazel? Can you point me to that?

We do check the files at the beginning, but that doesn't seem to be updating the leases.

We renew the lease by invalidating the skyframe node, by-passing action cache and re-executing the action. Within the execution, we either check remote cache (and get the cache hit if the files are still available) or re-execute the action remotely. In both cases, we inject the metadata with a new expiration.

I think you looked at the opposite case I was thinking about. If the file still has TTL, do we renew the lease for it? Is the assumption that Bazel side TTL is so much lower that the build is guaranteed to finish before that?

@@ -441,7 +449,8 @@ private boolean artifactIsDirtyWithDirectSystemCalls(
ImmutableSet<PathFragment> knownModifiedOutputFiles,
boolean trustRemoteArtifacts,
Map.Entry<? extends Artifact, FileArtifactValue> entry,
ModifiedOutputsReceiver modifiedOutputsReceiver) {
ModifiedOutputsReceiver modifiedOutputsReceiver,
long epochMilli) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the role of FilesystemValueChecker in this design? Is it merely an optimization to prevent spurious rewinding or is that necessary for correctness?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is an optimization to prevent rewinding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's solely an optimization, then I care significantly less about whether that seems unsound. One concern here is that e.g. outputs checking can be disabled for independent reasons (--experimental_check_output_files). Future features (e.g. imagine we added watchfs for outputs) could also break that in a way which is difficult to connect--that would be quite a leap for whoever implements such a feature to connect it with this behavior for remote execution.

@coeuvre
Copy link
Member Author

coeuvre commented Nov 11, 2022

I think you are right, we can't avoid having a lease service to track all the remote metadata. I will update the code.

@alexjski
Copy link
Contributor

I think you are right, we can't avoid having a lease service to track all the remote metadata. I will update the code.

Good question if all this is doing is preventing rewinding. But yeah, if you had a background thread updating those, then the FileSystemValueChecker is not a big deal. In fact, all you would care about is for action cache not to introduce a stale file.

This background refresh may be a little problematic given it may leak memory (old outputs). You may consider some tricks with weak references to alleviate that. Anyway, that smells of complexity, so please use your judgement (if an occasional rewind works, maybe that's a good enough start?).

@coeuvre coeuvre changed the title [WIP] Add TTL to remote metadata [WIP] Implement a lease service for tracking remote metadata Nov 23, 2022
@coeuvre coeuvre changed the title [WIP] Implement a lease service for tracking remote metadata Implement a lease service to support remote cache eviction Jan 23, 2023
@coeuvre coeuvre force-pushed the remote-metadata-ttl branch from 070f32d to b58f1b3 Compare January 27, 2023 14:14
copybara-service bot pushed a commit that referenced this pull request Feb 14, 2023
…g an invocation

Part of #16660.

Closes #17358.

PiperOrigin-RevId: 509494072
Change-Id: Id6944da5d9a556dc9154fcb702948586b474875e
hvadehra pushed a commit that referenced this pull request Feb 14, 2023
…g an invocation

Part of #16660.

Closes #17358.

PiperOrigin-RevId: 509494072
Change-Id: Id6944da5d9a556dc9154fcb702948586b474875e
coeuvre added a commit to coeuvre/bazel that referenced this pull request Feb 15, 2023
…g an invocation

Part of bazelbuild#16660.

Closes bazelbuild#17358.

PiperOrigin-RevId: 509494072
Change-Id: Id6944da5d9a556dc9154fcb702948586b474875e
coeuvre added a commit to coeuvre/bazel that referenced this pull request Feb 16, 2023
…g an invocation

Part of bazelbuild#16660.

Closes bazelbuild#17358.

PiperOrigin-RevId: 509494072
Change-Id: Id6944da5d9a556dc9154fcb702948586b474875e
keertk added a commit that referenced this pull request Feb 19, 2023
…g an invocation (#17496)

Part of #16660.

Closes #17358.

PiperOrigin-RevId: 509494072
Change-Id: Id6944da5d9a556dc9154fcb702948586b474875e

Co-authored-by: kshyanashree <[email protected]>
Co-authored-by: keertk <[email protected]>
copybara-service bot pushed a commit that referenced this pull request Feb 20, 2023
Currently, when building without the bytes, if Bazel failed to download blobs from CAS when fetching them as inputs to local actions, Bazel fails the build with message like `... --remote_download_outputs=minimal does not work if your remote cache evicts files during builds.` and this message keep showing up until a manually `bazel clean`.

This PR fixes that by cleaning up stale state in skyframe and action cache upon remote cache eviction so that a following build can continue without `bazel shutdown` or `bazel clean`.

Fixes #17366.
Part of #16660.

Closes #17462.

PiperOrigin-RevId: 510952745
Change-Id: I4fc59a21195565c68375a19ead76738d2208c4ac
@coeuvre
Copy link
Member Author

coeuvre commented Mar 8, 2023

I have updated the design doc with implementation details and the progress for 6.1. Please have a look if you are interested.

copybara-service bot pushed a commit that referenced this pull request Mar 14, 2023
With TTL based discarding and upcoming lease extension, remote cache eviction error won't happen if remote cache can guarantee the TTL. However, if it happens, it usually means the remote cache is under high load and it could possibly evict more blobs that Bazel wouldn't aware of. Following builds could still fail for the same error (caused by different blobs).

This PR changes to remove all remote metadata when the remove cache eviction error happens (which should be rare with the help from TTL based discarding and lease extension) to make sure next incremental build can success.

Part of #16660.

Closes #17747.

PiperOrigin-RevId: 516519657
Change-Id: Ia99770b9d314ca62801b73dc96d09ed8ac2233f6
ShreeM01 pushed a commit to ShreeM01/bazel that referenced this pull request Mar 14, 2023
With TTL based discarding and upcoming lease extension, remote cache eviction error won't happen if remote cache can guarantee the TTL. However, if it happens, it usually means the remote cache is under high load and it could possibly evict more blobs that Bazel wouldn't aware of. Following builds could still fail for the same error (caused by different blobs).

This PR changes to remove all remote metadata when the remove cache eviction error happens (which should be rare with the help from TTL based discarding and lease extension) to make sure next incremental build can success.

Part of bazelbuild#16660.

Closes bazelbuild#17747.

PiperOrigin-RevId: 516519657
Change-Id: Ia99770b9d314ca62801b73dc96d09ed8ac2233f6
ShreeM01 added a commit that referenced this pull request Mar 15, 2023
#17770)

With TTL based discarding and upcoming lease extension, remote cache eviction error won't happen if remote cache can guarantee the TTL. However, if it happens, it usually means the remote cache is under high load and it could possibly evict more blobs that Bazel wouldn't aware of. Following builds could still fail for the same error (caused by different blobs).

This PR changes to remove all remote metadata when the remove cache eviction error happens (which should be rare with the help from TTL based discarding and lease extension) to make sure next incremental build can success.

Part of #16660.

Closes #17747.

PiperOrigin-RevId: 516519657
Change-Id: Ia99770b9d314ca62801b73dc96d09ed8ac2233f6

Co-authored-by: Chi Wang <[email protected]>
copybara-service bot pushed a commit that referenced this pull request Mar 30, 2023
With #17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With #17462 and #17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with #17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of #16660.

Closes #17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
ShreeM01 pushed a commit to ShreeM01/bazel that referenced this pull request Mar 30, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
ShreeM01 pushed a commit to ShreeM01/bazel that referenced this pull request Mar 31, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
coeuvre added a commit to coeuvre/bazel that referenced this pull request Apr 21, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
coeuvre added a commit to coeuvre/bazel that referenced this pull request Apr 21, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
coeuvre added a commit to coeuvre/bazel that referenced this pull request Apr 21, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
coeuvre added a commit to coeuvre/bazel that referenced this pull request Apr 21, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
keertk pushed a commit that referenced this pull request Apr 21, 2023
…ror (#18171)

With #17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With #17462 and #17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with #17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of #16660.

Closes #17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
fweikert pushed a commit to fweikert/bazel that referenced this pull request May 25, 2023
Previously, we handle remote cache eviction when downloading inputs for local actions. However, it's possible that when Bazel need to re-execute an remote action, it detected that some inputs are missing from remote CAS. In this case, Bazel will try to upload inputs by reading from local filesystem. Since the inputs were generated remotely, not downloaded and evicted remotely, the upload will fail with FileNotFoundException.

This CL changes the code to correctly handles above case by reading through ActionFS when uploading inputs and propagate CacheNotFoundException.

Related to bazelbuild#16660.

PiperOrigin-RevId: 512568547
Change-Id: I3a28cadbb6285fa3727e1603f37abf8843c093c9
fweikert pushed a commit to fweikert/bazel that referenced this pull request May 25, 2023
When building without the bytes, Bazel stores `RemoteFileArtifactValue` in skyframe (inmemory) and in local action cache which represents a file that is stored remotely. Bazel assumes that the remote file will never expire which is wrong. In practice, remote cache often evict files due to space constraint, and when it happens, the builds could fail.

This PR introduces flag `--experimental_remote_cache_ttl` which tells Bazel at least how long the remote cache could store a file after returning a reference of it to Bazel. Bazel calculates the TTL of the file and store it in the `RemoteFileArtifactValue`. In an incremental build, Bazel will discard the `RemoteFileArtifactValue` and rerun the generating actions if it finds out that the `RemoteFileArtifactValue` is expired. The new field `expireAtEpochMilli` replaces `actionId` (deleted by f62a8b9), so there shouldn't be memory regression.

There are two places Bazel checks the TTL:
1. If the skyframe has in-memory state about previous builds (e.g. incremental builds), the `SkyValue`s are marked as dirty if the `RemoteFileArtifactValue` is expired.
2. When checking local action cache, if the `RemoteFileArtifactValue` is expired, the cache entry is ignored.

So that the generating actions can be re-executed.

Part of bazelbuild#16660.

Closes bazelbuild#17639.

RELNOTES: Add flag `--experimental_remote_cache_ttl` and set the default value to 3 hours.
PiperOrigin-RevId: 513819724
Change-Id: I9c9813621d04d5b1b94312be39384962feae2f7b
fweikert pushed a commit to fweikert/bazel that referenced this pull request May 25, 2023
With TTL based discarding and upcoming lease extension, remote cache eviction error won't happen if remote cache can guarantee the TTL. However, if it happens, it usually means the remote cache is under high load and it could possibly evict more blobs that Bazel wouldn't aware of. Following builds could still fail for the same error (caused by different blobs).

This PR changes to remove all remote metadata when the remove cache eviction error happens (which should be rare with the help from TTL based discarding and lease extension) to make sure next incremental build can success.

Part of bazelbuild#16660.

Closes bazelbuild#17747.

PiperOrigin-RevId: 516519657
Change-Id: Ia99770b9d314ca62801b73dc96d09ed8ac2233f6
fweikert pushed a commit to fweikert/bazel that referenced this pull request May 25, 2023
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown.

However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases.

Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**.

```
$ bazel build --experimental_remote_cache_eviction_retries=5 ...
INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4
Target //a:bar failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 0.447s, Critical Path: 0.05s
INFO: 2 processes: 2 internal.
ERROR: Build did NOT complete successfully
Found remote cache eviction error, retrying the build...
INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140
Loading:
Loading:
Loading: 0 packages loaded
Analyzing: target //a:bar (0 packages loaded, 0 targets configured)
INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
[0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
Target //a:bar up-to-date:
  bazel-bin/a/bar.out
INFO: Elapsed time: 0.866s, Critical Path: 0.35s
INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote.
INFO: Build completed successfully, 3 total actions
$
```

Part of bazelbuild#16660.

Closes bazelbuild#17711.

PiperOrigin-RevId: 520610524
Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
@coeuvre
Copy link
Member Author

coeuvre commented Jun 6, 2023

This PR has been merged in the forms of small CLs. Closing.

@brentleyjones
Copy link
Contributor

@coeuvre Does that mean last_green has the lease service, and we can test it there?

copybara-service bot pushed a commit that referenced this pull request Jun 28, 2023
Add flag `--experimental_remote_cache_lease_extension`, which when set, Bazel will create a background thread periodically sending `FindMissingBlobs` requests to CAS during the build.

1. All the outputs that were not downloaded are within the scope of lease extension. The outputs are acquired from skyframe by traversing the action graph.
2. Lease extension starts after any action was built and ends after execution phase ended. The frequency is related to `--experimental_remote_cache_ttl`.
3. Lease extensions are performed on action basis, not by collecting all outputs and issue one giant `FindMissingBlobs`.
    - Collecting all outputs might increase memory watermark and cause OOM.
    - Sending one `FindMissingBlobs` request per action may increase the overhead of network roundtrip, but the cost should be saturated given that the lease extension happens at background and is not wall time critical.
4. For an incremental build, the same applies: lease extension starts after any action was executed.
    - We don't want lease extension blocking action execution, nor affecting build performance.
    - Since we have TTL based cache discarding, any expired blobs will be discarded.
    - Leases of blobs that are not downloaded, still used by this build (because they are referenced by skyframe) will be extended as normal.

Part of #16660.

Closes #17944.

PiperOrigin-RevId: 544032753
Change-Id: Iafe8b96c48abbb2e67302cd7a2f06f97ab43f825
@dws
Copy link
Contributor

dws commented May 30, 2024

See #22592 (--experimental_remote_cache_ttl=0s breaks --remote_download_outputs=toplevel (Build without the Bytes))

@coeuvre coeuvre deleted the remote-metadata-ttl branch June 3, 2024 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Builds without the Bytes fails on missing AC result
4 participants