-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow action cache checks - please revert 1e17348da7e45c00cb474390a3b8ed3103b6b5cf #19924
Comments
This was running on an arm64 instance, inside a container restricted to 4 cores. I tried to make sure that it would get local action cache hits only, but there are a few tests that don't pass on arm64 at that commit (I had to use an old commit because the repo no longer builds with a bazel binary this old). |
This is similar to my experience with #17120, I hope that we can sort some of these out and put protection in to avoid unnecessary performance regressions in the future |
@ulfjack for the one benchmark you showed, what is the action cache size on disk? Is this on public code, so I can repro? What is the number of jobs (compared with the number of cores, which seems to be 4)? @werkt in what environment did you experience the slowdown? In general, from what I have seen is that in most cases it has been a no-op performance wise and the larger the action cache size and the higher the number of jobs, the more of an improvement it is. |
That is from our internal repo, I'm afraid. I can try to repro with Bazel source, but I'm out for a week. This ran with I suspect that it's hashing output files while holding the lock, which could be I/O bound. |
(I also tried to repro with x64, but the setup I was using was too flaky. I think the next time I will just run bazel build/test twice in a row with a bazel shutdown inbetween.) |
My issue (which is tangential to @ulfjack's issue here) was that N-processor thread calls Decreasing --jobs leaves the 16 and 48 meters intact, only changing the remaining buildRemoteActions waits. remoteActionBuildingSemaphore is being used to regulate CPU and RAM pressure - merkle or generalized RAM estimation should be used to regulate the latter, distinct from the former, in low overhead merkle tree situations (bazel-stress uses minimal inputs and only measures action throughput, so it is a pathological representation of the lowest possible memory overhead). |
Ulf, is this on Apple Silicon or Linux Arm64? Did you have any luck reproducing this on a publicly available example? |
This reverts commit 1e17348. This was requested in bazelbuild#19924.
I have created a PR to revert said commit as we are close to creating the final RC for Bazel 7 and it seems this investigation will take a little bit more time. I have tried a little bit more and have not seen any slowdown, so answering the questions above would be good to understand why this is happening. |
This reverts commit 1e17348. This was requested in bazelbuild#19924. Closes bazelbuild#20162. PiperOrigin-RevId: 581897901 Change-Id: Ifea2330c45c97db4454ffdcc31b7b7af640cd659
I ran this on both linux x64 and arm64, but didn't get a clean sample (due to how I set it up) on x64. |
I unsuccessfully tried to repro on x64 yesterday. |
We've just encountered a build where with |
How large was the local action cache in this case? Can you share the (perhaps redacted) blaze trace? If you have ways to repro this, please let me know. I have not seen any case myself where the current flag setting was slower, and lots where it was faster, so I am wondering what's different. |
The one profile that I saw did have 2000 jobs and only 3 cores which is suspicious, so I tried to reproduce locally with a large artificial build (all cached) and was not able to. If anyone has a public repro (even artificial) of the slowdown you guys are seeing here I would like to see that. In all cases where I tested the semaphore was wall time neutral or positive. Now while looking into this, I saw that changing jobs from 2000 to 50 did speed up the build significantly. Now I assume you have a high number of jobs because of remote caching and execution. In general, I hope that @coeuvre's work on threadpool overhaul and async execution will make the need to tune jobs unnecessary. |
I'm currently trying to update our codebase to 7.0.0 which has the flag again. Unfortunately, I'm seeing a bunch of failures, which I haven't tracked down yet. |
I managed to upgrade to 7.0.0. |
Description of the bug:
@meisterT enabled the action cache throttle in 1e17348, but the description doesn't have any benchmark results or any other data supporting the claim that nobody would want this to be disabled. I had to go back to a fairly old commit in our own repo, but it looks like it has a significant impact on build times for us:
with the throttle enabled: 1m 31s
with the throttle disabled: 1m 19s
I wasn't able to get a cleaner signal, but we can clearly see the "acquiring semaphore" pieces in the profile:
throttle enabled:
![with-throttling](https://private-user-images.githubusercontent.com/7355745/277474990-6740168b-ec24-43ad-9503-1b6856e9e14a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNjI5NTMsIm5iZiI6MTczOTM2MjY1MywicGF0aCI6Ii83MzU1NzQ1LzI3NzQ3NDk5MC02NzQwMTY4Yi1lYzI0LTQzYWQtOTUwMy0xYjY4NTZlOWUxNGEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTJUMTIxNzMzWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ODNhYzc2YjJiNjVlODNmNDMyMGE2MzcyYjZhYjQ5NDVmYjc3ZmE2MThmMzcxOTA3YzYzZGYzNDk3M2UxOTcxMyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.7kRb3VY_HYiEitKQm6sXBj4f4ZGS0OX-pJp0vghLpWU)
throttle disabled
![without-throttling](https://private-user-images.githubusercontent.com/7355745/277475013-e0c05587-be52-418d-87a3-e27878f5ab6b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNjI5NTMsIm5iZiI6MTczOTM2MjY1MywicGF0aCI6Ii83MzU1NzQ1LzI3NzQ3NTAxMy1lMGMwNTU4Ny1iZTUyLTQxOGQtODdhMy1lMjc4NzhmNWFiNmIucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTJUMTIxNzMzWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MGYwMmYwYWRhMzE4Y2NlYTNjNTYzYzYxOGYwODdmZGRmNDBlMWE2MTJhOGI1ZmJmMThhZTE4MWY5YzZhYmNmYyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.VDPrHZN9nS6CK5cwRJjgRMv-_BrLnrc5-o1p4385IfI)
We can see that the action cache checks with throttling take until ~60s, while the action cache checks without throttling take until ~40s.
Which category does this issue belong to?
Performance
What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
No response
Which operating system are you running Bazel on?
Linux
What is the output of
bazel info release
?7.0.0-pre.20230530.3
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Is this a regression? If yes, please try to identify the Bazel commit where the bug was introduced.
It looks like commit 1e17348 removed the flag which can be used to work around the issue.
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
No response
The text was updated successfully, but these errors were encountered: