-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a lease service to support remote cache eviction #16660
Conversation
@alexjski: this is not ready yet but want to get some inputs from you about the changes to There are two ways to check the TTL for a remote metadata:
This PR goes with option 2. We save the TTL into the remote metadata and check whether it is still alive before we use it. |
I learnt that some servers would like to set the TTL to |
548098e
to
b541120
Compare
defaultValue = "2h", | ||
documentationCategory = OptionDocumentationCategory.REMOTE, | ||
effectTags = {OptionEffectTag.EXECUTION}, | ||
converter = RemoteDurationConverter.class, | ||
help = "The guaranteed minimal age of blobs in the remote cache after their digests are " | ||
+ "recently referenced e.g. by an ActionResult. Bazel does several optimizations based on " | ||
+ "the blobs' age e.g. doesn't repeatedly call GetActionResult in an incremental build. " | ||
+ "The value should be set slightly less than the real age since there is a gap between " | ||
+ "when the server returns the digests and when Bazel receives them." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should say what setting this to 0
does. Also, I think the default should be 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add more details and what the behaviour of setting this to 0
after I nail down all the implementation details.
I agree 2h
is probably not a good default but 0
isn't neither. Probably we should default to null
which means Bazel assume the blobs will never expire (just like what it does today).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree that null is a good default. It would mean the reported bug isn't fixed for people unless they set a value here. I think 0 is the best default: bazel will recheck on each build. If you want to optimize it, because you know your remote cache can keep things around for a certain time, you can increase it.
Having 0 as a default is a nice middle ground: the default situation means people no longer run into the bug, which makes BwtB much more usable, while allowing remote cache maintainers to advise on a value to set for their users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting to 0
will have extremely bad incremental build performance: Bazel have to invalidate all in-memory cache and call GetActionResult
on every action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once per build, right?
If it can't be 0, then a low non-zero value is better than null. In that case I'm supportive of a couple hours. Maybe have it default to whatever the value of https://bazel.build/reference/command-line-reference#flag--max_idle_secs is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once per build, right?
yes, once per action per build.
If it can't be 0, then a low non-zero value is better than null. In that case I'm supportive of a couple hours. Maybe have it default to whatever the value of https://bazel.build/reference/command-line-reference#flag--max_idle_secs is?
Not sure. I will find a good default value once this PR is finalized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a general concern that for slow builds, a file can expire in the middle of it? We do check the files at the beginning, but that doesn't seem to be updating the leases. We can somehow pad that by setting the option pretty low, but overall it is unsound as long as nothing centrally manages lease renewals for those.
@@ -591,12 +602,13 @@ public boolean equals(Object o) { | |||
return Arrays.equals(digest, that.digest) | |||
&& size == that.size | |||
&& locationIndex == that.locationIndex | |||
&& Objects.equals(actionId, that.actionId); | |||
&& Objects.equals(actionId, that.actionId) | |||
&& expiredAtEpochMilli == that.expiredAtEpochMilli; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Does that increase the size of this object? If so, it may be a good time to split actionId and this timestamp to a new subclass of its own.
- Can an action rerun renew this? If so, will that not negatively affect downstream action through no skyframe pruning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Does that increase the size of this object? If so, it may be a good time to split actionId and this timestamp to a new subclass of its own.
We plan to remove actionId
. I can do it in a separate PR. Does that sound good to you? I am open to create a subclass too.
2. Can an action rerun renew this? If so, will that not negatively affect downstream action through no skyframe pruning?
Yes. Imaging the metadata expired in an incremental build, if the file doesn't exist locally, skyframe will invalidate the action node and re-evaluate it. The action cache will be skipped because we also check the expiration there. So the generating action will be re-executed, hitting the remote cache and inject a remote metadata with new expiration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We plan to remove
actionId
. I can do it in a separate PR. Does that sound good to you? I am open to create a subclass too.
My take is that I would strongly recommend running one of the large builds upon import and checking memory impact.
So the generating action will be re-executed, hitting the remote cache and inject a remote metadata with new expiration.
Let me get back to my question. After such execution, with the same results, we will not evaluate the ActionExecutionValue
as equal for downstream actions due your newly added condition. Therefore, change-pruning will not kick in, even though it could. You will get an action cache hit for downstream actions, but it should be unnecessary to check that at all for those had change-pruning worked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Maybe exclude expiration
from equals
?
* Returns {@code true} if the file is remote and is available remotely at time {@code epochMilli} | ||
* which is the milliseconds since epoch. | ||
*/ | ||
public boolean isRemotelyAvailable(long epochMilli) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever call that on a FileArtifactValue for which we don't know it is a RemoteFileArtifactValue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I can move this into RemoteFileArtifactValue
.
@@ -453,7 +462,8 @@ private boolean artifactIsDirtyWithDirectSystemCalls( | |||
boolean trustRemoteValue = | |||
fileMetadata.getType() == FileStateType.NONEXISTENT | |||
&& lastKnownData.isRemote() | |||
&& trustRemoteArtifacts; | |||
&& trustRemoteArtifacts | |||
&& lastKnownData.isRemotelyAvailable(epochMilli); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is, that looks like in the internal case, that will mark 100% of remote values as requiring rebuild, which would be unacceptable (unless we use Long.MAX_VALUE there). I would recommend running some tests against that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am aware of this and will update the internal code during import to make sure they behave the same as today.
During the REAPI meeting we all agreed that it's assumed that build artifacts will continue to exist for the duration of a build. It's on the remote cache to ensure that is the case. |
This is not the concern for this PR at least. Like Brentley mentioned above, most remote cache will ensure the files are available during the build. Bazel will complain and crash if remote cache failed to that. Action rewinding will improve that but in the worst case, it will result in a endless loop (or reach the max rewinding time and crash).
We renew the lease by invalidating the skyframe node, by-passing action cache and re-executing the action. Within the execution, we either check remote cache (and get the cache hit if the files are still available) or re-execute the action remotely. In both cases, we inject the metadata with a new expiration. |
Can you explain the principle behind what guarantees that the files will remain for the duration of the build?
Is there code which tracks all remote metadata and proactively renews the leases with RBE in Bazel? Can you point me to that?
I think you looked at the opposite case I was thinking about. If the file still has TTL, do we renew the lease for it? Is the assumption that Bazel side TTL is so much lower that the build is guaranteed to finish before that? |
@@ -441,7 +449,8 @@ private boolean artifactIsDirtyWithDirectSystemCalls( | |||
ImmutableSet<PathFragment> knownModifiedOutputFiles, | |||
boolean trustRemoteArtifacts, | |||
Map.Entry<? extends Artifact, FileArtifactValue> entry, | |||
ModifiedOutputsReceiver modifiedOutputsReceiver) { | |||
ModifiedOutputsReceiver modifiedOutputsReceiver, | |||
long epochMilli) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the role of FilesystemValueChecker
in this design? Is it merely an optimization to prevent spurious rewinding or is that necessary for correctness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is an optimization to prevent rewinding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's solely an optimization, then I care significantly less about whether that seems unsound. One concern here is that e.g. outputs checking can be disabled for independent reasons (--experimental_check_output_files
). Future features (e.g. imagine we added watchfs for outputs) could also break that in a way which is difficult to connect--that would be quite a leap for whoever implements such a feature to connect it with this behavior for remote execution.
I think you are right, we can't avoid having a lease service to track all the remote metadata. I will update the code. |
Good question if all this is doing is preventing rewinding. But yeah, if you had a background thread updating those, then the This background refresh may be a little problematic given it may leak memory (old outputs). You may consider some tricks with weak references to alleviate that. Anyway, that smells of complexity, so please use your judgement (if an occasional rewind works, maybe that's a good enough start?). |
070f32d
to
b58f1b3
Compare
…g an invocation Part of bazelbuild#16660. Closes bazelbuild#17358. PiperOrigin-RevId: 509494072 Change-Id: Id6944da5d9a556dc9154fcb702948586b474875e
…g an invocation Part of bazelbuild#16660. Closes bazelbuild#17358. PiperOrigin-RevId: 509494072 Change-Id: Id6944da5d9a556dc9154fcb702948586b474875e
…g an invocation (#17496) Part of #16660. Closes #17358. PiperOrigin-RevId: 509494072 Change-Id: Id6944da5d9a556dc9154fcb702948586b474875e Co-authored-by: kshyanashree <[email protected]> Co-authored-by: keertk <[email protected]>
Currently, when building without the bytes, if Bazel failed to download blobs from CAS when fetching them as inputs to local actions, Bazel fails the build with message like `... --remote_download_outputs=minimal does not work if your remote cache evicts files during builds.` and this message keep showing up until a manually `bazel clean`. This PR fixes that by cleaning up stale state in skyframe and action cache upon remote cache eviction so that a following build can continue without `bazel shutdown` or `bazel clean`. Fixes #17366. Part of #16660. Closes #17462. PiperOrigin-RevId: 510952745 Change-Id: I4fc59a21195565c68375a19ead76738d2208c4ac
I have updated the design doc with implementation details and the progress for 6.1. Please have a look if you are interested. |
With TTL based discarding and upcoming lease extension, remote cache eviction error won't happen if remote cache can guarantee the TTL. However, if it happens, it usually means the remote cache is under high load and it could possibly evict more blobs that Bazel wouldn't aware of. Following builds could still fail for the same error (caused by different blobs). This PR changes to remove all remote metadata when the remove cache eviction error happens (which should be rare with the help from TTL based discarding and lease extension) to make sure next incremental build can success. Part of #16660. Closes #17747. PiperOrigin-RevId: 516519657 Change-Id: Ia99770b9d314ca62801b73dc96d09ed8ac2233f6
With TTL based discarding and upcoming lease extension, remote cache eviction error won't happen if remote cache can guarantee the TTL. However, if it happens, it usually means the remote cache is under high load and it could possibly evict more blobs that Bazel wouldn't aware of. Following builds could still fail for the same error (caused by different blobs). This PR changes to remove all remote metadata when the remove cache eviction error happens (which should be rare with the help from TTL based discarding and lease extension) to make sure next incremental build can success. Part of bazelbuild#16660. Closes bazelbuild#17747. PiperOrigin-RevId: 516519657 Change-Id: Ia99770b9d314ca62801b73dc96d09ed8ac2233f6
#17770) With TTL based discarding and upcoming lease extension, remote cache eviction error won't happen if remote cache can guarantee the TTL. However, if it happens, it usually means the remote cache is under high load and it could possibly evict more blobs that Bazel wouldn't aware of. Following builds could still fail for the same error (caused by different blobs). This PR changes to remove all remote metadata when the remove cache eviction error happens (which should be rare with the help from TTL based discarding and lease extension) to make sure next incremental build can success. Part of #16660. Closes #17747. PiperOrigin-RevId: 516519657 Change-Id: Ia99770b9d314ca62801b73dc96d09ed8ac2233f6 Co-authored-by: Chi Wang <[email protected]>
With #17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With #17462 and #17747, Bazel is able to continue the build without bazel clean or bazel shutdown. However, even with #17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases. Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**. ``` $ bazel build --experimental_remote_cache_eviction_retries=5 ... INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4 Target //a:bar failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 0.447s, Critical Path: 0.05s INFO: 2 processes: 2 internal. ERROR: Build did NOT complete successfully Found remote cache eviction error, retrying the build... INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt Target //a:bar up-to-date: bazel-bin/a/bar.out INFO: Elapsed time: 0.866s, Critical Path: 0.35s INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote. INFO: Build completed successfully, 3 total actions $ ``` Part of #16660. Closes #17711. PiperOrigin-RevId: 520610524 Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown. However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases. Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**. ``` $ bazel build --experimental_remote_cache_eviction_retries=5 ... INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4 Target //a:bar failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 0.447s, Critical Path: 0.05s INFO: 2 processes: 2 internal. ERROR: Build did NOT complete successfully Found remote cache eviction error, retrying the build... INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt Target //a:bar up-to-date: bazel-bin/a/bar.out INFO: Elapsed time: 0.866s, Critical Path: 0.35s INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote. INFO: Build completed successfully, 3 total actions $ ``` Part of bazelbuild#16660. Closes bazelbuild#17711. PiperOrigin-RevId: 520610524 Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown. However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases. Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**. ``` $ bazel build --experimental_remote_cache_eviction_retries=5 ... INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4 Target //a:bar failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 0.447s, Critical Path: 0.05s INFO: 2 processes: 2 internal. ERROR: Build did NOT complete successfully Found remote cache eviction error, retrying the build... INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt Target //a:bar up-to-date: bazel-bin/a/bar.out INFO: Elapsed time: 0.866s, Critical Path: 0.35s INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote. INFO: Build completed successfully, 3 total actions $ ``` Part of bazelbuild#16660. Closes bazelbuild#17711. PiperOrigin-RevId: 520610524 Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown. However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases. Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**. ``` $ bazel build --experimental_remote_cache_eviction_retries=5 ... INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4 Target //a:bar failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 0.447s, Critical Path: 0.05s INFO: 2 processes: 2 internal. ERROR: Build did NOT complete successfully Found remote cache eviction error, retrying the build... INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt Target //a:bar up-to-date: bazel-bin/a/bar.out INFO: Elapsed time: 0.866s, Critical Path: 0.35s INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote. INFO: Build completed successfully, 3 total actions $ ``` Part of bazelbuild#16660. Closes bazelbuild#17711. PiperOrigin-RevId: 520610524 Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown. However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases. Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**. ``` $ bazel build --experimental_remote_cache_eviction_retries=5 ... INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4 Target //a:bar failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 0.447s, Critical Path: 0.05s INFO: 2 processes: 2 internal. ERROR: Build did NOT complete successfully Found remote cache eviction error, retrying the build... INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt Target //a:bar up-to-date: bazel-bin/a/bar.out INFO: Elapsed time: 0.866s, Critical Path: 0.35s INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote. INFO: Build completed successfully, 3 total actions $ ``` Part of bazelbuild#16660. Closes bazelbuild#17711. PiperOrigin-RevId: 520610524 Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown. However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases. Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**. ``` $ bazel build --experimental_remote_cache_eviction_retries=5 ... INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4 Target //a:bar failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 0.447s, Critical Path: 0.05s INFO: 2 processes: 2 internal. ERROR: Build did NOT complete successfully Found remote cache eviction error, retrying the build... INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt Target //a:bar up-to-date: bazel-bin/a/bar.out INFO: Elapsed time: 0.866s, Critical Path: 0.35s INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote. INFO: Build completed successfully, 3 total actions $ ``` Part of bazelbuild#16660. Closes bazelbuild#17711. PiperOrigin-RevId: 520610524 Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown. However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases. Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**. ``` $ bazel build --experimental_remote_cache_eviction_retries=5 ... INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4 Target //a:bar failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 0.447s, Critical Path: 0.05s INFO: 2 processes: 2 internal. ERROR: Build did NOT complete successfully Found remote cache eviction error, retrying the build... INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt Target //a:bar up-to-date: bazel-bin/a/bar.out INFO: Elapsed time: 0.866s, Critical Path: 0.35s INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote. INFO: Build completed successfully, 3 total actions $ ``` Part of bazelbuild#16660. Closes bazelbuild#17711. PiperOrigin-RevId: 520610524 Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
…ror (#18171) With #17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With #17462 and #17747, Bazel is able to continue the build without bazel clean or bazel shutdown. However, even with #17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases. Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**. ``` $ bazel build --experimental_remote_cache_eviction_retries=5 ... INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4 Target //a:bar failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 0.447s, Critical Path: 0.05s INFO: 2 processes: 2 internal. ERROR: Build did NOT complete successfully Found remote cache eviction error, retrying the build... INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt Target //a:bar up-to-date: bazel-bin/a/bar.out INFO: Elapsed time: 0.866s, Critical Path: 0.35s INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote. INFO: Build completed successfully, 3 total actions $ ``` Part of #16660. Closes #17711. PiperOrigin-RevId: 520610524 Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
Previously, we handle remote cache eviction when downloading inputs for local actions. However, it's possible that when Bazel need to re-execute an remote action, it detected that some inputs are missing from remote CAS. In this case, Bazel will try to upload inputs by reading from local filesystem. Since the inputs were generated remotely, not downloaded and evicted remotely, the upload will fail with FileNotFoundException. This CL changes the code to correctly handles above case by reading through ActionFS when uploading inputs and propagate CacheNotFoundException. Related to bazelbuild#16660. PiperOrigin-RevId: 512568547 Change-Id: I3a28cadbb6285fa3727e1603f37abf8843c093c9
When building without the bytes, Bazel stores `RemoteFileArtifactValue` in skyframe (inmemory) and in local action cache which represents a file that is stored remotely. Bazel assumes that the remote file will never expire which is wrong. In practice, remote cache often evict files due to space constraint, and when it happens, the builds could fail. This PR introduces flag `--experimental_remote_cache_ttl` which tells Bazel at least how long the remote cache could store a file after returning a reference of it to Bazel. Bazel calculates the TTL of the file and store it in the `RemoteFileArtifactValue`. In an incremental build, Bazel will discard the `RemoteFileArtifactValue` and rerun the generating actions if it finds out that the `RemoteFileArtifactValue` is expired. The new field `expireAtEpochMilli` replaces `actionId` (deleted by f62a8b9), so there shouldn't be memory regression. There are two places Bazel checks the TTL: 1. If the skyframe has in-memory state about previous builds (e.g. incremental builds), the `SkyValue`s are marked as dirty if the `RemoteFileArtifactValue` is expired. 2. When checking local action cache, if the `RemoteFileArtifactValue` is expired, the cache entry is ignored. So that the generating actions can be re-executed. Part of bazelbuild#16660. Closes bazelbuild#17639. RELNOTES: Add flag `--experimental_remote_cache_ttl` and set the default value to 3 hours. PiperOrigin-RevId: 513819724 Change-Id: I9c9813621d04d5b1b94312be39384962feae2f7b
With TTL based discarding and upcoming lease extension, remote cache eviction error won't happen if remote cache can guarantee the TTL. However, if it happens, it usually means the remote cache is under high load and it could possibly evict more blobs that Bazel wouldn't aware of. Following builds could still fail for the same error (caused by different blobs). This PR changes to remove all remote metadata when the remove cache eviction error happens (which should be rare with the help from TTL based discarding and lease extension) to make sure next incremental build can success. Part of bazelbuild#16660. Closes bazelbuild#17747. PiperOrigin-RevId: 516519657 Change-Id: Ia99770b9d314ca62801b73dc96d09ed8ac2233f6
With bazelbuild#17358, Bazel will exit with code 39 if remote cache evicts blobs during the build. With bazelbuild#17462 and bazelbuild#17747, Bazel is able to continue the build without bazel clean or bazel shutdown. However, even with bazelbuild#17639 and following changes to extend the lease, remote cache can still evict blobs in some rare cases. Based on above changes, this PR makes bazel retry the invocation if it encountered the remote cache eviction error during previous invocation if `--experimental_remote_cache_eviction_retries` is set, or **build rewinding**. ``` $ bazel build --experimental_remote_cache_eviction_retries=5 ... INFO: Invocation ID: b7348bfa-9446-4c72-a888-0a0ad012f225 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt ERROR: .../workspace/a/BUILD:8:8: Executing genrule //a:bar failed: Failed to fetch blobs because they do not exist remotely: Missing digest: b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c/4 Target //a:bar failed to build Use --verbose_failures to see the command lines of failed build steps. INFO: Elapsed time: 0.447s, Critical Path: 0.05s INFO: 2 processes: 2 internal. ERROR: Build did NOT complete successfully Found remote cache eviction error, retrying the build... INFO: Invocation ID: 983f60dc-8bb9-4b82-aa33-a378469ce140 Loading: Loading: Loading: 0 packages loaded Analyzing: target //a:bar (0 packages loaded, 0 targets configured) INFO: Analyzed target //a:bar (0 packages loaded, 0 targets configured). INFO: Found 1 target... [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt Target //a:bar up-to-date: bazel-bin/a/bar.out INFO: Elapsed time: 0.866s, Critical Path: 0.35s INFO: 3 processes: 1 internal, 1 processwrapper-sandbox, 1 remote. INFO: Build completed successfully, 3 total actions $ ``` Part of bazelbuild#16660. Closes bazelbuild#17711. PiperOrigin-RevId: 520610524 Change-Id: I20d43d1968767a03250b9c8f8a6dda4e056d4f52
This PR has been merged in the forms of small CLs. Closing. |
@coeuvre Does that mean |
Add flag `--experimental_remote_cache_lease_extension`, which when set, Bazel will create a background thread periodically sending `FindMissingBlobs` requests to CAS during the build. 1. All the outputs that were not downloaded are within the scope of lease extension. The outputs are acquired from skyframe by traversing the action graph. 2. Lease extension starts after any action was built and ends after execution phase ended. The frequency is related to `--experimental_remote_cache_ttl`. 3. Lease extensions are performed on action basis, not by collecting all outputs and issue one giant `FindMissingBlobs`. - Collecting all outputs might increase memory watermark and cause OOM. - Sending one `FindMissingBlobs` request per action may increase the overhead of network roundtrip, but the cost should be saturated given that the lease extension happens at background and is not wall time critical. 4. For an incremental build, the same applies: lease extension starts after any action was executed. - We don't want lease extension blocking action execution, nor affecting build performance. - Since we have TTL based cache discarding, any expired blobs will be discarded. - Leases of blobs that are not downloaded, still used by this build (because they are referenced by skyframe) will be extended as normal. Part of #16660. Closes #17944. PiperOrigin-RevId: 544032753 Change-Id: Iafe8b96c48abbb2e67302cd7a2f06f97ab43f825
See #22592 (--experimental_remote_cache_ttl=0s breaks --remote_download_outputs=toplevel (Build without the Bytes)) |
Design doc
Fixes #10880.
Working towards #8250.