Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel 6.0.0rc4 breaks Bazel RBE tests #16933

Closed
meteorcloudy opened this issue Dec 6, 2022 · 19 comments
Closed

Bazel 6.0.0rc4 breaks Bazel RBE tests #16933

meteorcloudy opened this issue Dec 6, 2022 · 19 comments
Labels
P1 I'll work on this now. (Assignee required) type: bug

Comments

@meteorcloudy
Copy link
Member

meteorcloudy commented Dec 6, 2022

This was first detected in the downstream test for Bazel 6.0.0rc4: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/2761#0184d3b5-3cf0-410a-984d-2fdfdf78a49b

And verified in https://buildkite.com/bazel/bazel-bazel/builds/21471#0184e876-6406-4909-9a81-e8dc957b471f

Integration tests with test args are failing remotely with test logs look like:

exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //src/test/shell/bazel:bazel_example_test
-----------------------------------------------------------------------------
WARNING: Passing test names in arguments (--test_arg) is deprecated, please use --test_filter='/usr/lib/jvm/java-11-openjdk-amd64' instead.
WARNING: Arguments do not specify tests!
INFO[bazel_example_test 2022-12-06 17:28:07 (+0000)] bazel binary is at /b/f/w/bazel-out/k8-fastbuild/bin/src/test/shell/bazel/bazel_example_test.runfiles/io_bazel/src/test/shell/bin
INFO[bazel_example_test 2022-12-06 17:28:07 (+0000)] setting up client in /b/f/w/_tmp/55e5a95f98bdfb926d4a13d2054933c5/workspace
@meteorcloudy meteorcloudy added type: bug P1 I'll work on this now. (Assignee required) labels Dec 6, 2022
@meteorcloudy meteorcloudy added this to the 6.0.0 release blockers milestone Dec 6, 2022
@meteorcloudy
Copy link
Member Author

meteorcloudy commented Dec 6, 2022

This issue also happens with Bazel@HEAD: https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/2766#0184e535-8df3-4114-9f65-8777f5bc7fe4, but the Bazel Auto Sheriff points to a commit that's not in Bazel 6.0.0rc4: https://buildkite.com/bazel/bazel-auto-sheriff-face-with-cowboy-hat/builds/1118

Update: culprit finder failed due to not being able to download the Bazel binary.

:bazel: Test with Bazel built at fb2324620a4adee19bb4e8a030b6fba58f672dc0
:googlecloud: Downloading Bazel built at fb2324620a4adee19bb4e8a030b6fba58f672dc0
 
gsutil cp gs://bazel-builds/artifacts/centos7/fb2324620a4adee19bb4e8a030b6fba58f672dc0/bazel /tmp/tmp09usce0c/bazel
CommandException: No URLs matched: gs://bazel-builds/artifacts/centos7/fb2324620a4adee19bb4e8a030b6fba58f672dc0/bazel
Failed to download Bazel binary at fb2324620a4adee19bb4e8a030b6fba58f672dc0, error message:
Command '['gsutil', 'cp', 'gs://bazel-builds/artifacts/centos7/fb2324620a4adee19bb4e8a030b6fba58f672dc0/bazel', '/tmp/tmp09usce0c/bazel']' returned non-zero exit status 1.

We should fix culprit finder

@meteorcloudy
Copy link
Member Author

/cc @coeuvre Since this only happens with RBE enabled.

rbe_ubuntu1804:

@meteorcloudy
Copy link
Member Author

meteorcloudy commented Dec 6, 2022

@coeuvre Is #16749 the possible culprit? probably not since the original change was submitted in Oct, we have only been seen the failure with Bazel@HEAD in Nov.

@meteorcloudy
Copy link
Member Author

@meteorcloudy
Copy link
Member Author

meteorcloudy commented Dec 6, 2022

Bisecting the rc4 branch at https://buildkite.com/bazel/culprit-finder/builds/3876 and it points to 0b64525

meteorcloudy added a commit to bazelbuild/continuous-integration that referenced this issue Dec 6, 2022
… commit

Related bazelbuild/bazel#16933 (comment)

- The "Publish Bazel binaries" pipeline sometimes fails to publish Bazel
binaries for a commit, in this case, the Culprit Finder should
immediately fail instead consider the commit as a bad Bazel commit.

- Also removed some unused parameters in the function.
@meteorcloudy
Copy link
Member Author

Bisecting again on master branch after fixing the CI issue: https://buildkite.com/bazel/culprit-finder/builds/3877

@meteorcloudy
Copy link
Member Author

meteorcloudy commented Dec 6, 2022

While waiting for the bisect result, I can also reproduce at 839ce7f (original commit of the 6.0.0 cherrypick 0b64525)

USE_BAZEL_VERSION=839ce7f5c40240d8b6f49c416c3769e226f43fee bazel test //src/test/shell/bazel:bazel_example_test --config=ubuntu1804_java11 --remote_executor=grpcs://remotebuildexecution.googleapis.com

But not at it's parent commit.

/cc @fmeum @Wyverald

@fmeum
Copy link
Collaborator

fmeum commented Dec 6, 2022

@meteorcloudy Could you run these tests with --test_env=RUNFILES_LIB_DEBUG=1? That should provide us with very helpful information.

@meteorcloudy
Copy link
Member Author

@fmeum Here is the test log:

exec ${PAGER:-/usr/bin/less} "$0" || exit 1
Executing tests from //src/test/shell/bazel:bazel_example_test
-----------------------------------------------------------------------------
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/integration_test_setup.sh): start
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/integration_test_setup.sh): not using repository mapping () since it does not exist
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/integration_test_setup.sh): found under RUNFILES_DIR (/b/f/w/bazel-out/k8-fastbuild/bin/src/test/shell/bazel/bazel_example_test.runfiles), return
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/unittest.bash): start
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/unittest.bash): not using repository mapping () since it does not exist
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/unittest.bash): found under RUNFILES_DIR (/b/f/w/bazel-out/k8-fastbuild/bin/src/test/shell/bazel/bazel_example_test.runfiles), return
WARNING: Passing test names in arguments (--test_arg) is deprecated, please use --test_filter='/usr/lib/jvm/java-11-openjdk-amd64' instead.
WARNING: Arguments do not specify tests!
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/testenv.sh): start
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/testenv.sh): not using repository mapping () since it does not exist
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/testenv.sh): found under RUNFILES_DIR (/b/f/w/bazel-out/k8-fastbuild/bin/src/test/shell/bazel/bazel_example_test.runfiles), return
INFO[runfiles.bash]: rlocation(io_bazel/src/bazel): start
INFO[runfiles.bash]: rlocation(io_bazel/src/bazel): not using repository mapping () since it does not exist
INFO[runfiles.bash]: rlocation(io_bazel/src/bazel): found under RUNFILES_DIR (/b/f/w/bazel-out/k8-fastbuild/bin/src/test/shell/bazel/bazel_example_test.runfiles), return
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/bin/bazel): start
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/bin/bazel): not using repository mapping () since it does not exist
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/bin/bazel): found under RUNFILES_DIR (/b/f/w/bazel-out/k8-fastbuild/bin/src/test/shell/bazel/bazel_example_test.runfiles), return
INFO[runfiles.bash]: rlocation(io_bazel/tools/BUILD): start
INFO[runfiles.bash]: rlocation(io_bazel/tools/BUILD): not using repository mapping () since it does not exist
INFO[runfiles.bash]: rlocation(io_bazel/tools/BUILD): found under RUNFILES_DIR (/b/f/w/bazel-out/k8-fastbuild/bin/src/test/shell/bazel/bazel_example_test.runfiles), return
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/bazel/testing_server.py): start
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/bazel/testing_server.py): not using repository mapping () since it does not exist
INFO[runfiles.bash]: rlocation(io_bazel/src/test/shell/bazel/testing_server.py): found under RUNFILES_DIR (/b/f/w/bazel-out/k8-fastbuild/bin/src/test/shell/bazel/bazel_example_test.runfiles), return
INFO[bazel_example_test 2022-12-07 09:02:51 (+0000)] bazel binary is at /b/f/w/bazel-out/k8-fastbuild/bin/src/test/shell/bazel/bazel_example_test.runfiles/io_bazel/src/test/shell/bin
INFO[bazel_example_test 2022-12-07 09:02:51 (+0000)] setting up client in /b/f/w/_tmp/bfbf694ed20a8c587876e28f8f9820b1/workspace
INFO[runfiles.bash]: rlocation(/usr/lib/jvm/java-11-openjdk-amd64/bin/java): start
INFO[runfiles.bash]: rlocation(/usr/lib/jvm/java-11-openjdk-amd64/bin/java): absolute path, return
INFO[runfiles.bash]: rlocation(/usr/lib/jvm/java-11-openjdk-amd64/bin/java): not using repository mapping () since it does not exist
ERROR[runfiles.bash]: cannot look up runfile "/usr/lib/jvm/java-11-openjdk-amd64/bin/java"  (RUNFILES_DIR="/b/f/w/bazel-out/k8-fastbuild/bin/src/test/shell/bazel/bazel_example_test.runfiles", RUNFILES_MANIFEST_FILE="")

@fmeum
Copy link
Collaborator

fmeum commented Dec 7, 2022

@coeuvre I suspected a different issue, so I would like to confirm that isn't a problem either: The Bash runfiles library currently relies on absolute (!) paths of generated files containing an execroot segment. Is that guaranteed to be the case with remote execution, just like it is with local execution?

@fmeum
Copy link
Collaborator

fmeum commented Dec 7, 2022

The logic for absolute paths passed to rlocation broke. I am investigating why the existing test didn't catch that.

@coeuvre
Copy link
Member

coeuvre commented Dec 7, 2022

@coeuvre I suspected a different issue, so I would like to confirm that isn't a problem either: The Bash runfiles library currently relies on absolute (!) paths of generated files containing an execroot segment. Is that guaranteed to be the case with remote execution, just like it is with local execution?

My gut feeling is that execroot is not included in the path when actions inspect the path inside remote worker but I may be wrong. We use execroot as input root when uploading inputs for actions. Remote worker is free to move the input root to any local directories so a file /.../execroot/bazel-out/a in bazel host is probably become /..../bazel-out/a in remote worker.

@fmeum
Copy link
Collaborator

fmeum commented Dec 7, 2022

@coeuvre Thanks, I will remove that assumption in a second PR.

@meteorcloudy The Bash runfiles test didn't cover the exit codes of rlocation, only whatever it printed, and thus missed some corner cases. I have reworked the tests to check the exit codes and am now fixing these cases.

@meteorcloudy
Copy link
Member Author

@fmeum @coeuvre Thanks!!

@fmeum
Copy link
Collaborator

fmeum commented Dec 7, 2022

I submitted two PRs with fixes, #16945 and #16946.

@fmeum
Copy link
Collaborator

fmeum commented Dec 7, 2022

I think I know how to fix the RBE test failures, but now that I studied this more closely, there is some underlying inconsistent behavior that makes the library difficult to test and potentially also hard to use: $(rlocation does/not/exist) has exit code 0 if and only if the runfiles manifest exists. This means that the behavior differs between local and remote execution on Windows, which is probably why we only noticed this in downstream tests.

@laszlocsomor Sorry for the sideline mention, but can you recall how you intended the exit code of Bash $(rlocation ...) to behave?

@Wyverald
Copy link
Member

Wyverald commented Dec 7, 2022

$(rlocationpath does/not/exist) has exit code 0 if and only if the runfiles manifest exists.

Did you mean $(rlocation does/not/exist)?

@fmeum
Copy link
Collaborator

fmeum commented Dec 7, 2022

@Wyverald Yes, sorry, fixed.

@fmeum
Copy link
Collaborator

fmeum commented Dec 7, 2022

Fixing this inconsistency isn't required to get the tests to pass, but if we want to do it, 6.0.0 is probably the right time. Since $(failing_command) doesn't usually fail the outer shell script, having $(rlocation does/not/exist) exit with exit code 1 in all cases doesn't seem to be that breaking.

@Wyverald @meteorcloudy What do you think?

copybara-service bot pushed a commit that referenced this issue Dec 8, 2022
During remote execution, absolute paths may not contain `execroot` segments.

Work towards #16933

Closes #16946.

PiperOrigin-RevId: 493876135
Change-Id: I5ceb8c6edbe0e58aae2066efe1230e622eabf9bf
Wyverald pushed a commit that referenced this issue Dec 8, 2022
During remote execution, absolute paths may not contain `execroot` segments.

Work towards #16933

Closes #16946.

PiperOrigin-RevId: 493876135
Change-Id: I5ceb8c6edbe0e58aae2066efe1230e622eabf9bf
fmeum added a commit to fmeum/bazel that referenced this issue Dec 8, 2022
Due to tests not actually verifying the exit code of rlocation calls,
a few special cases were not handled correctly. In particular, absolute
paths still went through regular rlocation lookup and could fail there
even after the absolute path had been printed to stdout.

Also adds a missing newline in the (very rare) case that a manifest
entry is found, but doesn't point to an existing file (e.g. if it is an
unresolved symlink), and `rlocation` is not used with command
substitution, but some other mechanism that doesn't strip trailing
newlines.

The tests now assert the expected exit code (== 0 or != 0).

Fixes bazelbuild#16933
fmeum added a commit to fmeum/bazel that referenced this issue Dec 8, 2022
Due to tests not actually verifying the exit code of rlocation calls,
a few special cases were not handled correctly. In particular, absolute
paths still went through regular rlocation lookup and could fail there
even after the absolute path had been printed to stdout.

Also adds a missing newline in the (very rare) case that a manifest
entry is found, but doesn't point to an existing file (e.g. if it is an
unresolved symlink), and `rlocation` is not used with command
substitution, but some other mechanism that doesn't strip trailing
newlines.

The tests now assert the expected exit code (== 0 or != 0).

Fixes bazelbuild#16933
Wyverald added a commit that referenced this issue Dec 8, 2022
During remote execution, absolute paths may not contain `execroot` segments.

Work towards #16933

Closes #16946.

PiperOrigin-RevId: 493876135
Change-Id: I5ceb8c6edbe0e58aae2066efe1230e622eabf9bf

Co-authored-by: Fabian Meumertzheim <[email protected]>
meteorcloudy added a commit to bazelbuild/continuous-integration that referenced this issue Dec 9, 2022
… commit (#1507)

Related
bazelbuild/bazel#16933 (comment)

BuildKiteException is always caught at

https://github.com/bazelbuild/continuous-integration/blob/4592271b2fdcf26f865d9c0082e1e53add7ea9a2/buildkite/bazelci.py#L3592-L3594

- The "Publish Bazel binaries" pipeline sometimes fails to publish Bazel
binaries for a commit, in this case, the Culprit Finder should
immediately fail instead consider the commit as a bad Bazel commit due
to failing build.

- Also removed some unused parameters in the function.
fmeum added a commit to fmeum/bazel that referenced this issue Dec 12, 2022
Due to tests not actually verifying the exit code of rlocation calls, a few special cases were not handled correctly. In particular, absolute paths still went through regular rlocation lookup and could fail there even after the absolute path had been printed to stdout.

Also adds a missing newline in the (very rare) case that a manifest entry is found, but doesn't point to an existing file (e.g. if it is an unresolved symlink), and `rlocation` is not used with command substitution, but some other mechanism that doesn't strip trailing newlines.

The tests now assert the expected exit code (== 0 or != 0).

Fixes bazelbuild#16933

Closes bazelbuild#16945.

PiperOrigin-RevId: 494609104
Change-Id: I30333219176a61bda51f08c2c6bc927ce653d681
meteorcloudy pushed a commit that referenced this issue Dec 12, 2022
Due to tests not actually verifying the exit code of rlocation calls, a few special cases were not handled correctly. In particular, absolute paths still went through regular rlocation lookup and could fail there even after the absolute path had been printed to stdout.

Also adds a missing newline in the (very rare) case that a manifest entry is found, but doesn't point to an existing file (e.g. if it is an unresolved symlink), and `rlocation` is not used with command substitution, but some other mechanism that doesn't strip trailing newlines.

The tests now assert the expected exit code (== 0 or != 0).

Fixes #16933

Closes #16945.

PiperOrigin-RevId: 494609104
Change-Id: I30333219176a61bda51f08c2c6bc927ce653d681
Pavank1992 pushed a commit to iancha1992/bazel that referenced this issue Jun 21, 2023
Due to tests not actually verifying the exit code of rlocation calls,
a few special cases were not handled correctly. In particular, absolute
paths still went through regular rlocation lookup and could fail there
even after the absolute path had been printed to stdout.

Also adds a missing newline in the (very rare) case that a manifest
entry is found, but doesn't point to an existing file (e.g. if it is an
unresolved symlink), and `rlocation` is not used with command
substitution, but some other mechanism that doesn't strip trailing
newlines.

The tests now assert the expected exit code (== 0 or != 0).

Fixes bazelbuild#16933
Pavank1992 pushed a commit to iancha1992/bazel that referenced this issue Jun 21, 2023
Due to tests not actually verifying the exit code of rlocation calls,
a few special cases were not handled correctly. In particular, absolute
paths still went through regular rlocation lookup and could fail there
even after the absolute path had been printed to stdout.

Also adds a missing newline in the (very rare) case that a manifest
entry is found, but doesn't point to an existing file (e.g. if it is an
unresolved symlink), and `rlocation` is not used with command
substitution, but some other mechanism that doesn't strip trailing
newlines.

The tests now assert the expected exit code (== 0 or != 0).

Fixes bazelbuild#16933
fmeum pushed a commit to fmeum/continuous-integration that referenced this issue Dec 10, 2023
… commit (bazelbuild#1507)

Related
bazelbuild/bazel#16933 (comment)

BuildKiteException is always caught at

https://github.com/bazelbuild/continuous-integration/blob/4592271b2fdcf26f865d9c0082e1e53add7ea9a2/buildkite/bazelci.py#L3592-L3594

- The "Publish Bazel binaries" pipeline sometimes fails to publish Bazel
binaries for a commit, in this case, the Culprit Finder should
immediately fail instead consider the commit as a bad Bazel commit due
to failing build.

- Also removed some unused parameters in the function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants