-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build and test rippled in Github Actions, plus some related fixes and improvements #3851
Conversation
Some data for comparison: The Github Actions all ran simultaneously. This list indicates how long each of them took.
All of them succeeded. I expect they'll run faster in the future as caches are seeded. Travis took a little longer to get started, and when it did, it still took longer to run all the jobs.
And to top it off, two of the tasks failed:
I've restarted those to make sure there is not a real problem, but I doubt it. I can't make any promises that the Github jobs will never have a spurious failure, or start this quickly when more of us start using it, but these numbers look pretty good. |
NIce PR @ximinez ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed all of the first 8 commits, and much of the last one, but wanted to submit my comments now to get the conversation going.
Can I ask for an English-prose description of the architecture of the workflows? What is the design behind them? What is the separation of responsibilities, and the graph of dependencies? Why is the Travis configuration left in (fixed, even)?
export CXX=${CC/gcc/g++} | ||
export CXX=${CXX/clang/clang++} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to look up this substitution form. Correct me if I'm wrong:
113: If $CC
matches "gcc", replace the first "gcc" with "g++"
114: If $CXX
matches "clang", replace the first "clang" with "clang++"
If $CXX
enters these lines as "clang++", won't it exit as "clang++++"? Did you mean to write ${CC/...
on line 114?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
$CXX enters these lines as "clang++", won't it exit as "clang++++"? Did you mean to write $ {CC/... on line 114?
No, because CXX
is initialized on line 113, regardless of the value of $CC
.
CC=gcc
- 113:
CXX=g++
(substitution) - 114:
CXX=g++
(no substitution)
- 113:
CC=clang
- 113:
CXX=clang
(no substitution) - 114:
CXX=clang++
(substitution)
- 113:
This is another sort of artifact of specifying the matrix
. It's combinatoric, so it's non-trivial to set two variables together. If I did something like
CC: [ "gcc", "clang" ]
CXX: [ "g++", "clang++" ]
Then I'd get 4 jobs
- gcc/g++
- gcc/clang++
- clang/g++
- clang/clang++
And it would take extra work to exclude the redundant ones. This let's me do what I want in one place with one variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's non-trivial to set two variables together
This isn't true, and knowing the technique may open new ways to think about how to structure these actions (I'm looking at the relationship between image
and CMAKE_ARGS
too). Here's an example, and it's run. The key part of the matrix is this:
matrix:
compiler:
- cc: gcc
cxx: g++
- cc: clang
cxx: clang++
The key insight is that matrix
is a dictionary of keys to arrays of values. Those values can be compound values like more dictionaries. Coupling related variables in a dictionary is a way to get them all in the matrix expression without generating excess or invalid configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact you used the technique here. I guess this is actually a different facility of the matrix
, with special keys include
and exclude
, but perhaps it helps explain why the technique I described works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had assumed that include
and exclude
were special cases. So you've taught me another new thing. Thanks! I went through and restructured the jobs and the action as much as possible to take advantage of compound values, including getting rid of the logic based on image
.
echo NUM_PROCESSORS is ${NUM_PROCESSORS} | ||
if [ "${{ inputs.image }}" == "rippled-ci-builder:2944b78d22db" ] | ||
then | ||
export CMAKE_EXTRA_ARGS+=" -DBoost_NO_BOOST_CMAKE=ON" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it should be a parameter of this action, for when the image inevitably changes, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to only need to check it in one (or two) place(s), instead of every time the image is specified. Unfortunately, the workflow specification doesn't really have any kind of conditional operations, so I can't write (for example):
- name: Build and test
uses: ./.github/actions/build-action
with:
no_boost_cmake: ${{ matrix.image == "rippled-ci-builder:2944b78d22db" ? "ON" : "OFF" }}
Requiring it to be specified in the same place as the image can get complicated, especially in the matrix.
Plus, I figure when we inevitably get a new image, I can update the check as appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, I updated the matrix to use composite values and no longer require checking image
.
# Override environment variables that don't work smoothly | ||
# using env: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Environment variables can be specified using env:
at the workflow, job, and step level. I can't remember if particular comment/construction because the inputs
object was not available at that higher job level, or if defining it there would set some of those variables to be empty (instead of leaving them unset), and causing undesirable side effects, but I'm pretty sure it was one of those two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried following some of these variables. It seems NINJA_BUILD
is only used in bin/ci/ubuntu/build-and-test.sh
, and can be set to an empty value, except that this use probably needs to be corrected to ${NINJA_BUILD:-false}
:
I believe all the *SAN_OPTIONS
variables can be empty. I tested UBSAN_OPTIONS
on a build with -Dsan=undefined
and it worked.
I'd like to shrink this section as small as possible. Programs should be robust in the presence of empty variable values.
Going down this path brought my attention to a bigger issue. We've got files under bin/ci/ubuntu
that we expect to run on Windows. It was likely written first with Ubuntu in mind, and likely not robustly enough to handle empty strings for some environment variables. I've already tried once to fix this problem, when I wrote the pipelines for GitLab. Can we build on that effort and:
- redesign the scripts under
bin/ci/ubuntu
to handle empty environment variables correctly - remove any unused scripts (I haven't yet checked whether each is used)
- document them (I wrote a README for my scripts, and left helpful comments at the top of each documenting their interfaces), and
- remove "ubuntu" from their path if they are cross-platform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot here, and I haven't addressed all of it, but one simple thing I did was to change the == false
comparisons to != true
, which will make them more robust in the face of empty values and such.
2b9f304
to
199f6a2
Compare
@thejohnfreeman @manojsdoshi @legleux @xzhaous I just pushed a bunch of commits that fix more issues, and address most, if not all, of the feedback so far. To make reviewing a tiny bit easier, I didn't rewrite any of the commit history you may have already seen, and I tried to organize the commits logically. If this PR gets approved, I'll squash them down into the single "Build and test rippled in Github Actions" commit, other than the unit test fixes. Highlights:
To answer one lingering question: I left the Travis configuration in for now, so we can run them side-by-side for a while until we're confident that the real-world performance and reliability of Github Actions is at least as good. If you've been holding off reviewing this because I told you I was working on stuff, I am done, and this is ready to review again. For now. 🤞 |
.github/README.md
Outdated
[Github Actions caches](https://docs.github.com/en/actions/guides/caching-dependencies-to-speed-up-workflows#usage-limits-and-eviction-policy) | ||
are immutable once written, and use explicit key names for access. Caches can | ||
be shared across workflows, and can match partial names by prefix. They can | ||
also access caches created by base and default branches, but not across forks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc says:
A workflow can access and restore a cache created in the current branch, the base branch (including base branches of forked repositories)
I'm not sure which is true, or if I'm interpreting it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this to be clearer based on my understanding of the rules.
.github/README.md
Outdated
configuration, the rippled source code, and the workflow config itself. | ||
|
||
To pull it all together with an example, the base job in | ||
`linux-clang8-debug.yml` might have a cache key that looks like (hashes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the following example, I think this meant linux-clang8-release.yml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch. Fixed.
.github/workflows/linux-general.yml
Outdated
for CC in ${{ matrix.make_ccs }} | ||
do | ||
for BUILD_TYPE in ${{ matrix.make_types }} | ||
do | ||
export CC BUILD_TYPE | ||
# Set CXX to the C++ complier corresponding to CC | ||
# It's a little hacky, but keeps this job's matrix | ||
# a little simpler | ||
export CXX=${CC/gcc/g++} | ||
export CXX=${CXX/clang/clang++} | ||
|
||
${CC} --version | ||
${CXX} --version | ||
dir="build/$( basename ${CXX} ).${BUILD_TYPE}.makefile" | ||
mkdir "-pv" "${dir}" | ||
pushd "${dir}" | ||
set "-x" | ||
# Verbose or not, it's good to see the right params | ||
# are being used | ||
cmake -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \ | ||
${CMAKE_EXTRA_ARGS} ../.. | ||
set +x | ||
popd | ||
done | ||
done | ||
|
||
# Builds that use Ninja | ||
export CMAKE_EXTRA_ARGS+=" -G Ninja" | ||
for CC in ${{ matrix.ninja_ccs }} | ||
do | ||
for BUILD_TYPE in ${{ matrix.ninja_types }} | ||
do | ||
export CC BUILD_TYPE | ||
# Set CXX to the C++ complier corresponding to CC | ||
# It's a little hacky, but keeps this job's matrix | ||
# a little simpler | ||
export CXX=${CC/gcc/g++} | ||
export CXX=${CXX/clang/clang++} | ||
|
||
${CC} --version | ||
${CXX} --version | ||
dir="build/$( basename ${CXX} ).${BUILD_TYPE}" | ||
mkdir "-pv" "${dir}" | ||
pushd "${dir}" | ||
set "-x" | ||
# Verbose or not, it's good to see the right params | ||
# are being used | ||
cmake -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \ | ||
${CMAKE_EXTRA_ARGS} ../.. | ||
set +x | ||
popd | ||
done | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bundles the generated build files for a bunch of different builds into a single cache. Could it be better to separate them, so that the cache is smaller, they can be built in parallel, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I separate them, I may as well get rid of this step altogether (I'm not against that) because the base job for each config will download the dependencies itself.
The reason for putting everything into a single cache is that, with the change to store all the source code in a top-level folder, the size of that source code dwarfs the size of the individual config folders.
The job outputs some of the cache folder sizes, for example at https://github.com/ripple/rippled/runs/3186329095?check_suite_focus=true#step:7:1054
200M /__w/rippled/rippled/_cache/nih_c
4.6M /__w/rippled/rippled/_cache/nih_c/ninja
1.2M /__w/rippled/rippled/_cache/nih_c/ninja/GNU_9.2.1
1.2M /__w/rippled/rippled/_cache/nih_c/ninja/Clang_9.0.1
1.2M /__w/rippled/rippled/_cache/nih_c/ninja/GNU_8.3.0
1.2M /__w/rippled/rippled/_cache/nih_c/ninja/Clang_8.0.1
676K /__w/rippled/rippled/_cache/nih_c/unix_makefiles
672K /__w/rippled/rippled/_cache/nih_c/unix_makefiles/GNU_8.3.0
It skips over the src
folder (derp), but you can see the total size is 200M, while the total of the job-specific folders are about 5M. The rest is src
.
Now, all that said, I realize that this src
folder ends up getting duplicated across all of the jobs anyway, because they all use this cache as a "seed" for their own cache. Thinking about it now, I think I could make it significantly better:
- Download
src
into a separate cache, and have the other jobs use that as an immutable dependency. This will be similar to the Windows jobs, which build vcpkg and boost dependencies once, and share them among all the other jobs. Assuming it works, this will take less space, and a little less time, because I'd only need to callcmake
once. - Get rid of this cache step entirely and let the jobs download their own dependencies.
I'll play around with this a little bit and see what I can figure out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one month later.gif
It took a while due to other priorities, and the fact that test iteration can take a while on CI, but I've reorganized the caches sort of as described in option 1 for the Linux and Windows jobs. With a little experimentation, I modified the "seed cache" jobs to build the dependencies once per docker image, which seems to be the only way to force them to download themselves and their dependencies. It also preserves the "stamp" folders, which prevents the jobs from re-downloading. The stamp folders are compatible (enough) that they can be used by the different compilers. The build output is not, so I just throw it away. Not perfect, but a significant improvement. I haven't updated the MacOS jobs yet, because I wanted to get this pushed now that it's stablized.
One issue causing delays is that I was having consistent failures in the Windows jobs that all seemed to come back to the Check
tests, but I can't find anything wrong with those tests. I'm pretty sure it's resource related, but rather than spend another month tracking it down remotely, I decided that this problem is out of scope for this PR, and punted, adding an option to make those tests (or any other tests that have similar problems down the road) manual at build time, then have CI run them manually. That's the [FOLD] Make the check tests manual on Github CI Windows
commit.
Rebased on to 1.8.0-b4. No other changes. I see a couple of jobs have failed. I'll look into that. |
* upstream/develop: Add a unit test for invalid memos (XRPLF#4287)
* upstream/develop: Make NodeToShardRPC a manual test (XRPLF#4379) Update build instructions (XRPLF#4376)
* upstream/develop: Update BUILD.md (XRPLF#4383)
@ximinez given conan, what is the best path forward (or not) for this PR? |
I've changed this PR to be a draft, indicating that it's not ready for prime time. I would like to integrate it with Conan to build the dependencies, but use the much larger configuration matrix in these workflow files to get more test coverage, but it's not a high priority. (I don't necessarily have to do it if someone else wants to pick up the baton.) |
* upstream/develop: `XRPFees`: Fee setting and handling improvements (XRPLF#4247)
* upstream/develop: Revise CONTRIBUTING (XRPLF#4382)
* upstream/develop: Update dependency: grpc (XRPLF#4407) Introduce min/max observers for Number Optimize uint128_t division by 10 within Number.cpp Replace Number division algorithm Include rounding mode in XRPAmount to STAmount conversion. Remove undefined behavior * Taking the negative of a signed negative is UB, but taking the negative of an unsigned is not. Silence warnings Introduce rounding modes for Number: Use Number for IOUAmount and STAmount arithmetic Add tests Add implicit conversion from STAmount to Number Add clip Add conversions between Number, XRPAmount and int64_t AMM Add Number class and associated algorithms
* upstream/develop: Update documented pathfinding configuration defaults: (XRPLF#4409)
* upstream/develop: Rename to fixNonFungibleTokensV1_2 and some cosmetic changes (XRPLF#4419) Only account specified as destination can settle through brokerage: (XRPLF#4399) Prevent brokered sale of NFToken to owner: (XRPLF#4403) Fix 3 issues around NFToken offer acceptance (XRPLF#4380) Allow NFT to be burned when number of offers is greater than 500 (XRPLF#4346) Add fixUnburnableNFToken feature (XRPLF#4391) Change default vote on fixUniversalNumber from yes to no (XRPLF#4414)
* upstream/develop: Resolve a couple of Github Action CI annoyances: (XRPLF#4413)
* upstream/develop: README: Add a few source code starting points (XRPLF#4421)
* upstream/develop: Fix Conan version constraint in workflows (XRPLF#4430) Refactor getTrustedForLedger() (XRPLF#4424) README: Update "Build from source" section (XRPLF#4426)
* upstream/develop: Undo API changes introduced in XRPFees: (XRPLF#4429) Remove recipe for RocksDB and add recipe for Snappy (XRPLF#4431)
* upstream/develop: Set version to 1.10.0-rc4 Rename 'NFT' to 'NFToken' in DisallowIncoming flags (XRPLF#4442) Update Docker.md (XRPLF#4432) Update package building scripts and images to use Conan (XRPLF#4435) Disable duplicate detector: (XRPLF#4438)
* upstream/develop: Set version to 1.10.0
* upstream/develop: docs: security bug bounty acknowledgements (XRPLF#4460)
* upstream/develop: Rectify the import paths of boost/iterator: (XRPLF#4293) Allow port numbers be be specified with a colon: (XRPLF#4328) Use <=> operator for base_uint, Issue, and Book: (XRPLF#4411) Reporting Mode: Do not attempt to acquire missing data from peer network (XRPLF#4458)
* upstream/develop: docs: update protocol README (4457) `fixNFTokenRemint`: prevent NFT re-mint: (4406) fix: support RPC markers for any ledger object: (4361) refactor: optimize NodeStore object conversion: (4353) fix(ValidatorSite): handle rare null pointer dereference in timeout: (4420) fix(gateway_balances): handle overflow exception: (4355) Set version to 1.10.1-b1
* upstream/develop: feat: mark 4 amendments as obsolete: (4291) fix: `Expected` to return a value: (4401) build: add interface library libxrpl: (4449) docs: update build instructions: (4381) Set version to 1.10.1 Set version to 1.10.1-rc1 Build packages with Ubuntu 18.04
* upstream/develop: Fix unit test app.LedgerData (4484) Fix ledger_data to return an empty list: (4398) Add account flags to account_info response: (4459) Add logging for exceptions: (4400) Update example [validator_list_sites] (4448) Add link to BUILD.md: (4450) Update README.md (4463) Refactor fee initialization and configuration: (4319)
* upstream/develop: Add install instructions for package managers: (4472) Fix the fix for std::result_of (4496) Prevent replay attacks with NetworkID field: (4370) Avoid using std::shared_ptr when not necessary: (4218) Optimize `SHAMapItem` and leverage new slab allocator: (4218) Introduce support for a slabbed allocator: (4218) Add jss fields used by Clio `nft_info`: (4320) Add NFTokenPages to account_objects RPC: (4352)
We'll close this PR but the branch will remain, and @thejohnfreeman will look at it to copy over the useful parts. See #4371 (comment) |
This change makes progress on the plan in #4371. It does not replicate the full [matrix] implemented in #3851, but it does replicate the 1.ii section of the Linux matrix. It leverages "heavy" self-hosted runners, and demonstrates a repeatable pattern for future matrices. [matrix]: https://github.com/XRPLF/rippled/blob/d794a0f3f161bb30c74881172fc38f763d7d46e8/.github/README.md#continuous-integration
High Level Overview of Change
The primary focus of this PR is building and testing rippled using Github Actions. Along the way, some other issues came up that either needed to be fixed or were convenient improvements. The changes are divided into several different commits, which can (and probably should) be reviewed independently - they probably could have been independent PRs, but I didn't want to get hung up on the dependencies.
These commits should NOT be squashed down to a single commit for merging, unless there is a consensus otherwise.
Note that I restructured the directories for external dependencies (
.nih_c
). Until this is merged, if building in an existing repo, you will almost certainly need to move or delete that folder first, then move or delete again when switching to another branch.Context of Change
Travis CI can be slow and unreliable. Github Actions offers some improvement in those two areas. It's still not as fast as I'd like, but that can potentially be addressed in the future with hosted runners.
The new platform also offers different ways to organize the jobs, and manage caches and artifacts. For example,
Type of Change
Test Plan
There isn't much to test other than verifying that the CI jobs succeed. Unit tests were added, and some changes were made to non-test code, but they shouldn't affect any run-time functionality.
Future Tasks
As mentioned above, I would eventually like to get these jobs running on better hardware, either some kind of paid-tier resource via Github, or self-hosted runners.