Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build and test rippled in Github Actions, plus some related fixes and improvements #3851

Closed
wants to merge 65 commits into from

Conversation

ximinez
Copy link
Collaborator

@ximinez ximinez commented May 19, 2021

High Level Overview of Change

The primary focus of this PR is building and testing rippled using Github Actions. Along the way, some other issues came up that either needed to be fixed or were convenient improvements. The changes are divided into several different commits, which can (and probably should) be reviewed independently - they probably could have been independent PRs, but I didn't want to get hung up on the dependencies.

These commits should NOT be squashed down to a single commit for merging, unless there is a consensus otherwise.

Note that I restructured the directories for external dependencies (.nih_c). Until this is merged, if building in an existing repo, you will almost certainly need to move or delete that folder first, then move or delete again when switching to another branch.

Context of Change

Travis CI can be slow and unreliable. Github Actions offers some improvement in those two areas. It's still not as fast as I'd like, but that can potentially be addressed in the future with hosted runners.

The new platform also offers different ways to organize the jobs, and manage caches and artifacts. For example,

  • The tasks are organized across 7 different mostly-independent workflow files. If a failure occurs in one job, that workflow can be re-run without worrying about dependencies on others.
  • Every job is organized into steps, which can be expanded and collapsed on github.com, reducing the need to scroll or search through multi-thousand line log files.
  • Caches are accessed using keys, which are computed based on the CMake config and rippled source code, and are immutable once written. This should mitigate or prevent "corrupt" cache issues, and reduce or eliminate the need to manually delete caches for PRs and branches.
  • Logs and executables are provided as artifacts. This allows developers to get more information more easily, and debug the executable built by the job in the event of CI-only problems. It also speeds up test-only steps by reusing executables instead of rebuilding them.

Type of Change

  • [x ] Refactor (non-breaking change that only restructures code)
  • [x ] Tests (You added tests for code that already exists, or your new feature included in this PR)
  • [x ] Release

Test Plan

There isn't much to test other than verifying that the CI jobs succeed. Unit tests were added, and some changes were made to non-test code, but they shouldn't affect any run-time functionality.

Future Tasks

As mentioned above, I would eventually like to get these jobs running on better hardware, either some kind of paid-tier resource via Github, or self-hosted runners.

@ximinez ximinez added Build System CI Continuous Integration Functionality Dependencies Issues associated with 3rd party dependencies (RocksDB, SQLite, etc) labels May 19, 2021
@ximinez
Copy link
Collaborator Author

ximinez commented May 24, 2021

Some data for comparison:

The Github Actions all ran simultaneously. This list indicates how long each of them took.

  • Linux, gcc-8 release unity: 55m 49s
  • Linux, clang-8 debug unity: 55m 9s
  • Linux, clang-8 release unity: 1h 24m 1s
  • Linux, gcc-8 debug unity: 1h 46m 12s
  • MacOS: 1h 51m 45s
  • Linux: 2h 1m 24s
  • Windows: 3h 0m 54s

All of them succeeded. I expect they'll run faster in the future as caches are seeded.

Travis took a little longer to get started, and when it did, it still took longer to run all the jobs.

  • Ran for 6 hrs 38 min 59 sec
  • Total time 22 hrs 9 min 15 sec

And to top it off, two of the tasks failed:

  • ipv6 (macos)
  • windows, visual studio, debug

I've restarted those to make sure there is not a real problem, but I doubt it.

I can't make any promises that the Github jobs will never have a spurious failure, or start this quickly when more of us start using it, but these numbers look pretty good.

@carlhua
Copy link
Contributor

carlhua commented May 24, 2021

NIce PR @ximinez !

Copy link
Collaborator

@thejohnfreeman thejohnfreeman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed all of the first 8 commits, and much of the last one, but wanted to submit my comments now to get the conversation going.

Can I ask for an English-prose description of the architecture of the workflows? What is the design behind them? What is the separation of responsibilities, and the graph of dependencies? Why is the Travis configuration left in (fixed, even)?

.github/actions/build-action/action.yml Outdated Show resolved Hide resolved
.github/actions/build-action/action.yml Outdated Show resolved Hide resolved
Comment on lines 113 to 112
export CXX=${CC/gcc/g++}
export CXX=${CXX/clang/clang++}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to look up this substitution form. Correct me if I'm wrong:

113: If $CC matches "gcc", replace the first "gcc" with "g++"
114: If $CXX matches "clang", replace the first "clang" with "clang++"

If $CXX enters these lines as "clang++", won't it exit as "clang++++"? Did you mean to write ${CC/... on line 114?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If $CXX enters these lines as "clang++", won't it exit as "clang++++"? Did you mean to write ${CC/... on line 114?

No, because CXX is initialized on line 113, regardless of the value of $CC.

  • CC=gcc
    • 113: CXX=g++ (substitution)
    • 114: CXX=g++ (no substitution)
  • CC=clang
    • 113: CXX=clang (no substitution)
    • 114: CXX=clang++ (substitution)

This is another sort of artifact of specifying the matrix. It's combinatoric, so it's non-trivial to set two variables together. If I did something like

CC: [ "gcc", "clang" ]
CXX: [ "g++", "clang++" ]

Then I'd get 4 jobs

  • gcc/g++
  • gcc/clang++
  • clang/g++
  • clang/clang++
    And it would take extra work to exclude the redundant ones. This let's me do what I want in one place with one variable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's non-trivial to set two variables together

This isn't true, and knowing the technique may open new ways to think about how to structure these actions (I'm looking at the relationship between image and CMAKE_ARGS too). Here's an example, and it's run. The key part of the matrix is this:

matrix:
  compiler:
    - cc: gcc
      cxx: g++
    - cc: clang
      cxx: clang++

The key insight is that matrix is a dictionary of keys to arrays of values. Those values can be compound values like more dictionaries. Coupling related variables in a dictionary is a way to get them all in the matrix expression without generating excess or invalid configurations.

Copy link
Collaborator

@thejohnfreeman thejohnfreeman Jun 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact you used the technique here. I guess this is actually a different facility of the matrix, with special keys include and exclude, but perhaps it helps explain why the technique I described works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed that include and exclude were special cases. So you've taught me another new thing. Thanks! I went through and restructured the jobs and the action as much as possible to take advantage of compound values, including getting rid of the logic based on image.

echo NUM_PROCESSORS is ${NUM_PROCESSORS}
if [ "${{ inputs.image }}" == "rippled-ci-builder:2944b78d22db" ]
then
export CMAKE_EXTRA_ARGS+=" -DBoost_NO_BOOST_CMAKE=ON"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it should be a parameter of this action, for when the image inevitably changes, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to only need to check it in one (or two) place(s), instead of every time the image is specified. Unfortunately, the workflow specification doesn't really have any kind of conditional operations, so I can't write (for example):

      - name: Build and test
        uses: ./.github/actions/build-action
        with:
          no_boost_cmake: ${{ matrix.image == "rippled-ci-builder:2944b78d22db" ? "ON" : "OFF" }}

Requiring it to be specified in the same place as the image can get complicated, especially in the matrix.

Plus, I figure when we inevitably get a new image, I can update the check as appropriate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, I updated the matrix to use composite values and no longer require checking image.

Comment on lines 126 to 127
# Override environment variables that don't work smoothly
# using env:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean exactly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment variables can be specified using env: at the workflow, job, and step level. I can't remember if particular comment/construction because the inputs object was not available at that higher job level, or if defining it there would set some of those variables to be empty (instead of leaving them unset), and causing undesirable side effects, but I'm pretty sure it was one of those two.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried following some of these variables. It seems NINJA_BUILD is only used in bin/ci/ubuntu/build-and-test.sh, and can be set to an empty value, except that this use probably needs to be corrected to ${NINJA_BUILD:-false}:

https://github.com/ripple/rippled/blob/10e2c6d66c9ce7e530269e15684ca3fbbb88fc42/bin/ci/ubuntu/build-and-test.sh#L67

I believe all the *SAN_OPTIONS variables can be empty. I tested UBSAN_OPTIONS on a build with -Dsan=undefined and it worked.

I'd like to shrink this section as small as possible. Programs should be robust in the presence of empty variable values.

Going down this path brought my attention to a bigger issue. We've got files under bin/ci/ubuntu that we expect to run on Windows. It was likely written first with Ubuntu in mind, and likely not robustly enough to handle empty strings for some environment variables. I've already tried once to fix this problem, when I wrote the pipelines for GitLab. Can we build on that effort and:

  • redesign the scripts under bin/ci/ubuntu to handle empty environment variables correctly
  • remove any unused scripts (I haven't yet checked whether each is used)
  • document them (I wrote a README for my scripts, and left helpful comments at the top of each documenting their interfaces), and
  • remove "ubuntu" from their path if they are cross-platform?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot here, and I haven't addressed all of it, but one simple thing I did was to change the == false comparisons to != true, which will make them more robust in the face of empty values and such.

.travis.yml Outdated Show resolved Hide resolved
Builds/CMake/RippledNIH.cmake Outdated Show resolved Hide resolved
Builds/CMake/RippledSanity.cmake Outdated Show resolved Hide resolved
src/test/jtx/impl/Env.cpp Show resolved Hide resolved
src/test/json/json_value_test.cpp Outdated Show resolved Hide resolved
@ximinez ximinez force-pushed the github-ci branch 7 times, most recently from 2b9f304 to 199f6a2 Compare June 4, 2021 14:45
@manojsdoshi manojsdoshi requested review from legleux and xzhaous July 1, 2021 18:05
@ximinez
Copy link
Collaborator Author

ximinez commented Jul 15, 2021

@thejohnfreeman @manojsdoshi @legleux @xzhaous

I just pushed a bunch of commits that fix more issues, and address most, if not all, of the feedback so far. To make reviewing a tiny bit easier, I didn't rewrite any of the commit history you may have already seen, and I tried to organize the commits logically. If this PR gets approved, I'll squash them down into the single "Build and test rippled in Github Actions" commit, other than the unit test fixes.

Highlights:

  • The build-action has no parameters
  • I think I've resolved occasional build failures due to OOM. For now.
  • I think I've resolved occasional test failures by restricting the number of unittest processes in jobs that tended to fail.
  • Job names are much cleaner and clearer.
  • Caches are better organized, especially in Mac and Windows builds that have relatively static dependencies.
  • Documentation

To answer one lingering question: I left the Travis configuration in for now, so we can run them side-by-side for a while until we're confident that the real-world performance and reliability of Github Actions is at least as good.

If you've been holding off reviewing this because I told you I was working on stuff, I am done, and this is ready to review again. For now. 🤞

[Github Actions caches](https://docs.github.com/en/actions/guides/caching-dependencies-to-speed-up-workflows#usage-limits-and-eviction-policy)
are immutable once written, and use explicit key names for access. Caches can
be shared across workflows, and can match partial names by prefix. They can
also access caches created by base and default branches, but not across forks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc says:

A workflow can access and restore a cache created in the current branch, the base branch (including base branches of forked repositories)

I'm not sure which is true, or if I'm interpreting it correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this to be clearer based on my understanding of the rules.

configuration, the rippled source code, and the workflow config itself.

To pull it all together with an example, the base job in
`linux-clang8-debug.yml` might have a cache key that looks like (hashes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the following example, I think this meant linux-clang8-release.yml.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch. Fixed.

Comment on lines 116 to 168
for CC in ${{ matrix.make_ccs }}
do
for BUILD_TYPE in ${{ matrix.make_types }}
do
export CC BUILD_TYPE
# Set CXX to the C++ complier corresponding to CC
# It's a little hacky, but keeps this job's matrix
# a little simpler
export CXX=${CC/gcc/g++}
export CXX=${CXX/clang/clang++}

${CC} --version
${CXX} --version
dir="build/$( basename ${CXX} ).${BUILD_TYPE}.makefile"
mkdir "-pv" "${dir}"
pushd "${dir}"
set "-x"
# Verbose or not, it's good to see the right params
# are being used
cmake -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
${CMAKE_EXTRA_ARGS} ../..
set +x
popd
done
done

# Builds that use Ninja
export CMAKE_EXTRA_ARGS+=" -G Ninja"
for CC in ${{ matrix.ninja_ccs }}
do
for BUILD_TYPE in ${{ matrix.ninja_types }}
do
export CC BUILD_TYPE
# Set CXX to the C++ complier corresponding to CC
# It's a little hacky, but keeps this job's matrix
# a little simpler
export CXX=${CC/gcc/g++}
export CXX=${CXX/clang/clang++}

${CC} --version
${CXX} --version
dir="build/$( basename ${CXX} ).${BUILD_TYPE}"
mkdir "-pv" "${dir}"
pushd "${dir}"
set "-x"
# Verbose or not, it's good to see the right params
# are being used
cmake -DCMAKE_BUILD_TYPE=${BUILD_TYPE} \
${CMAKE_EXTRA_ARGS} ../..
set +x
popd
done
done
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bundles the generated build files for a bunch of different builds into a single cache. Could it be better to separate them, so that the cache is smaller, they can be built in parallel, etc.?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I separate them, I may as well get rid of this step altogether (I'm not against that) because the base job for each config will download the dependencies itself.

The reason for putting everything into a single cache is that, with the change to store all the source code in a top-level folder, the size of that source code dwarfs the size of the individual config folders.

The job outputs some of the cache folder sizes, for example at https://github.com/ripple/rippled/runs/3186329095?check_suite_focus=true#step:7:1054

200M	/__w/rippled/rippled/_cache/nih_c
4.6M	/__w/rippled/rippled/_cache/nih_c/ninja
1.2M	/__w/rippled/rippled/_cache/nih_c/ninja/GNU_9.2.1
1.2M	/__w/rippled/rippled/_cache/nih_c/ninja/Clang_9.0.1
1.2M	/__w/rippled/rippled/_cache/nih_c/ninja/GNU_8.3.0
1.2M	/__w/rippled/rippled/_cache/nih_c/ninja/Clang_8.0.1
676K	/__w/rippled/rippled/_cache/nih_c/unix_makefiles
672K	/__w/rippled/rippled/_cache/nih_c/unix_makefiles/GNU_8.3.0

It skips over the src folder (derp), but you can see the total size is 200M, while the total of the job-specific folders are about 5M. The rest is src.

Now, all that said, I realize that this src folder ends up getting duplicated across all of the jobs anyway, because they all use this cache as a "seed" for their own cache. Thinking about it now, I think I could make it significantly better:

  1. Download src into a separate cache, and have the other jobs use that as an immutable dependency. This will be similar to the Windows jobs, which build vcpkg and boost dependencies once, and share them among all the other jobs. Assuming it works, this will take less space, and a little less time, because I'd only need to call cmake once.
  2. Get rid of this cache step entirely and let the jobs download their own dependencies.

I'll play around with this a little bit and see what I can figure out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one month later.gif

It took a while due to other priorities, and the fact that test iteration can take a while on CI, but I've reorganized the caches sort of as described in option 1 for the Linux and Windows jobs. With a little experimentation, I modified the "seed cache" jobs to build the dependencies once per docker image, which seems to be the only way to force them to download themselves and their dependencies. It also preserves the "stamp" folders, which prevents the jobs from re-downloading. The stamp folders are compatible (enough) that they can be used by the different compilers. The build output is not, so I just throw it away. Not perfect, but a significant improvement. I haven't updated the MacOS jobs yet, because I wanted to get this pushed now that it's stablized.

One issue causing delays is that I was having consistent failures in the Windows jobs that all seemed to come back to the Check tests, but I can't find anything wrong with those tests. I'm pretty sure it's resource related, but rather than spend another month tracking it down remotely, I decided that this problem is out of scope for this PR, and punted, adding an option to make those tests (or any other tests that have similar problems down the road) manual at build time, then have CI run them manually. That's the [FOLD] Make the check tests manual on Github CI Windows commit.

@ximinez
Copy link
Collaborator Author

ximinez commented Jul 29, 2021

Rebased on to 1.8.0-b4. No other changes. I see a couple of jobs have failed. I'll look into that.

@intelliot
Copy link
Collaborator

@ximinez given conan, what is the best path forward (or not) for this PR?

@ximinez
Copy link
Collaborator Author

ximinez commented Jan 18, 2023

I've changed this PR to be a draft, indicating that it's not ready for prime time. I would like to integrate it with Conan to build the dependencies, but use the much larger configuration matrix in these workflow files to get more test coverage, but it's not a high priority. (I don't necessarily have to do it if someone else wants to pick up the baton.)

* upstream/develop:
  `XRPFees`: Fee setting and handling improvements (XRPLF#4247)
* upstream/develop:
  Update dependency: grpc (XRPLF#4407)
  Introduce min/max observers for Number
  Optimize uint128_t division by 10 within Number.cpp
  Replace Number division algorithm
  Include rounding mode in XRPAmount to STAmount conversion.
  Remove undefined behavior * Taking the negative of a signed negative is UB, but   taking the negative of an unsigned is not.
  Silence warnings
  Introduce rounding modes for Number:
  Use Number for IOUAmount and STAmount arithmetic
  Add tests
  Add implicit conversion from STAmount to Number
  Add clip
  Add conversions between Number, XRPAmount and int64_t
  AMM Add Number class and associated algorithms
* upstream/develop:
  Update documented pathfinding configuration defaults: (XRPLF#4409)
* upstream/develop:
  Rename to fixNonFungibleTokensV1_2 and some cosmetic changes (XRPLF#4419)
  Only account specified as destination can settle through brokerage: (XRPLF#4399)
  Prevent brokered sale of NFToken to owner: (XRPLF#4403)
  Fix 3 issues around NFToken offer acceptance (XRPLF#4380)
  Allow NFT to be burned when number of offers is greater than 500 (XRPLF#4346)
  Add fixUnburnableNFToken feature (XRPLF#4391)
  Change default vote on fixUniversalNumber from yes to no (XRPLF#4414)
* upstream/develop:
  Resolve a couple of Github Action CI annoyances: (XRPLF#4413)
* upstream/develop:
  README: Add a few source code starting points (XRPLF#4421)
* upstream/develop:
  Fix Conan version constraint in workflows (XRPLF#4430)
  Refactor getTrustedForLedger() (XRPLF#4424)
  README: Update "Build from source" section (XRPLF#4426)
* upstream/develop:
  Undo API changes introduced in XRPFees: (XRPLF#4429)
  Remove recipe for RocksDB and add recipe for Snappy (XRPLF#4431)
* upstream/develop:
  Set version to 1.10.0-rc4
  Rename 'NFT' to 'NFToken' in DisallowIncoming flags (XRPLF#4442)
  Update Docker.md (XRPLF#4432)
  Update package building scripts and images to use Conan (XRPLF#4435)
  Disable duplicate detector: (XRPLF#4438)
ximinez added 7 commits March 14, 2023 14:25
* upstream/develop:
  Set version to 1.10.0
* upstream/develop:
  docs: security bug bounty acknowledgements (XRPLF#4460)
* upstream/develop:
  Rectify the import paths of boost/iterator: (XRPLF#4293)
  Allow port numbers be be specified with a colon: (XRPLF#4328)
  Use <=> operator for base_uint, Issue, and Book: (XRPLF#4411)
  Reporting Mode: Do not attempt to acquire missing data from peer network (XRPLF#4458)
* upstream/develop:
  docs: update protocol README (4457)
  `fixNFTokenRemint`: prevent NFT re-mint: (4406)
  fix: support RPC markers for any ledger object: (4361)
  refactor: optimize NodeStore object conversion: (4353)
  fix(ValidatorSite): handle rare null pointer dereference in timeout: (4420)
  fix(gateway_balances): handle overflow exception: (4355)
  Set version to 1.10.1-b1
* upstream/develop:
  feat: mark 4 amendments as obsolete: (4291)
  fix: `Expected` to return a value: (4401)
  build: add interface library libxrpl: (4449)
  docs: update build instructions: (4381)
  Set version to 1.10.1
  Set version to 1.10.1-rc1
  Build packages with Ubuntu 18.04
* upstream/develop:
  Fix unit test app.LedgerData (4484)
  Fix ledger_data to return an empty list: (4398)
  Add account flags to account_info response: (4459)
  Add logging for exceptions: (4400)
  Update example [validator_list_sites] (4448)
  Add link to BUILD.md: (4450)
  Update README.md (4463)
  Refactor fee initialization and configuration: (4319)
* upstream/develop:
  Add install instructions for package managers: (4472)
  Fix the fix for std::result_of (4496)
  Prevent replay attacks with NetworkID field: (4370)
  Avoid using std::shared_ptr when not necessary: (4218)
  Optimize `SHAMapItem` and leverage new slab allocator: (4218)
  Introduce support for a slabbed allocator: (4218)
  Add jss fields used by Clio `nft_info`: (4320)
  Add NFTokenPages to account_objects RPC: (4352)
@intelliot
Copy link
Collaborator

  • goals and matrix here may be useful
  • there have been so many changes that it may be better to start over and just take what's useful from here

We'll close this PR but the branch will remain, and @thejohnfreeman will look at it to copy over the useful parts.

See #4371 (comment)

@intelliot intelliot closed this Apr 19, 2023
intelliot pushed a commit that referenced this pull request Apr 24, 2023
This change makes progress on the plan in #4371. It does not replicate
the full [matrix] implemented in #3851, but it does replicate the 1.ii
section of the Linux matrix. It leverages "heavy" self-hosted runners,
and demonstrates a repeatable pattern for future matrices.

[matrix]: https://github.com/XRPLF/rippled/blob/d794a0f3f161bb30c74881172fc38f763d7d46e8/.github/README.md#continuous-integration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build System CI Continuous Integration Functionality Dependencies Issues associated with 3rd party dependencies (RocksDB, SQLite, etc)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants