Make CI workflow resilient to multiple runs on the same sha or tag #3635

siggy · 2019-10-24T14:57:11Z

Background

PR #3602 guaranteed that all jobs in a CI workflow received the same copy of the repo. This made our CI workflow resilient to code changes occurring during the workflow run, both in master and the PR itself.

Problem

Our workflow uses the sha of the repo to name global objects that are shared between jobs within a workflow. When two workflow runs occur on the same sha or tag, these two runs will collide. Specifically, in the names of the kind clusters and git repo artifacts they create.

Collision examples

kind cluster collision: https://github.com/linkerd/linkerd2/runs/273402477#step:5:15
git repo artifact collision: https://github.com/linkerd/linkerd2/runs/272823368#step:2:7

Proposal

Assign unique names to kind clusters and git repo artifacts, based on the current workflow run, in addition to the output of bin/root-tag.

For example, instead of:

TAG="$(CI_FORCE_CLEAN=1 bin/root-tag)"
export KIND_CLUSTER=github-$TAG-${{ matrix.integration_test }}`

do something like:

TAG="$(CI_FORCE_CLEAN=1 bin/root-tag)"
export KIND_CLUSTER=github-$TAG-${{ runner.tracking_id }}-${{ matrix.integration_test }}`

(I'm not sure if runner.tracking_id exists or is too long, but we need something unique like this, or something in the {{job}} context.) More info at https://help.github.com/en/github/automating-your-workflow-with-github-actions/contexts-and-expression-syntax-for-github-actions.

Example code sites requiring update

kind cluster name

linkerd2/.github/workflows/workflow.yml

Lines 312 to 313 in 397970e

    
                   TAG="$(CI_FORCE_CLEAN=1 bin/root-tag)" 
        
                   export KIND_CLUSTER=github-$TAG-${{ matrix.integration_test }}

git Repo artifact

linkerd2/.github/workflows/workflow.yml

Lines 78 to 79 in 397970e

    
           name: linkerd2-src-${{ github.sha }} 
        
           path: linkerd2.${{ github.sha }}.tar.gz

kubeconfig file

Note this filename is based on the kind cluster name, so fixing kind cluster naming should fix this automatically:

linkerd2/.github/workflows/workflow.yml

Line 391 in 397970e

scp github@$DOCKER_ADDRESS:/tmp/kind-config-$KIND_CLUSTER $HOME/.kube

The text was updated successfully, but these errors were encountered:

alpeb · 2019-10-24T15:20:26Z

I'm wondering when two runs on the same sha can occur. Is it only when an empty commit to a branch is pushed to trigger the build, when the branch still has a build running?

ihcsim · 2019-10-24T15:34:21Z

I'm wondering when two runs on the same sha can occur.

Like a new tag and the latest master HEAD? Or do we handle that differently now in workflow.yaml?

siggy · 2019-10-24T16:05:28Z

@alpeb @ihcsim Good points. I've updated the issue title and description to better describe the situation. We don't exactly rely on the sha, but rather on the output of bin/root-tag, which is either a sha, or a tag (if one exists).

The situations where this can occur include:

master is updated to a commit, then soon after a tag is set to that same commit
the repo is tagged at a commit, then soon after the same tag is updated to a different commit

alpeb · 2019-10-25T17:34:28Z

Turns out the job and runner context are pretty limited:

job:
{
  "status": "Success"
}

runner:
{
  "os": "Linux",
  "tool_cache": "/opt/hostedtoolcache",
  "temp": "/home/runner/work/_temp",
  "workspace": "/home/runner/work/linkerd2"
}

and I couldn't find anything useful in the github context either.

I'm gonna try generating a random number in bash and storing it in the workflow's temp dir. Will have to check that dir is not shared across runs.

siggy · 2019-10-25T17:42:59Z

@alpeb I could be wrong, but I wonder if the GitHub Actions UI is obfuscating additional fields from the runner and job contexts? I ask because I can see env vars like RUNNER_TRACKING_ID that are presumably mapped from a context?

https://github.com/linkerd/linkerd2/runs/273480582#step:2:50

alpeb · 2019-10-25T19:23:24Z

I tested this:

    - name: Dump runner.tracking_id
      env:
        RUNNER_ID: ${{ runner.tracking_id }}
      run: echo "$RUNNER_ID"
    - name: Check if runner.tracking_id is empty
      env:
        RUNNER_ID: ${{ runner.tracking_id == ''}}
      run: echo "$RUNNER_ID"
    - name: Dump runner.tracking.id
      env:
        RUNNER_ID: ${{ runner.tracking.id }}
      run: echo "$RUNNER_ID"
    - name: Check if runner.tracking.id is empty
      env:
        RUNNER_ID: ${{ runner.tracking.id == ''}}
      run: echo "$RUNNER_ID"

which showed runner.tracking_id and runner.tracking.id don't exist.
I'll give RUNNER_TRACKING_ID a try.

alpeb · 2019-10-25T21:29:19Z

Turns out not all environment variables are carried over when using the container directive. For example here I'm dumping the env vars in the go_unit_tests step, and RUNNER_TRACKING_ID and friends are missing:
https://github.com/alpeb/linkerd2/runs/275474709#step:3:25

alpeb · 2019-10-25T23:17:27Z

I'm gonna try generating a random number in bash and storing it in the workflow's temp dir. Will have to check that dir is not shared across runs.

Answering myself :-P I had forgotten that each job uses a brand new environment and so the temp dir starts anew for each job.

OTOH, I stumbled upon this that describes a way to retrieve the "check suite ID" through github's API and jq. That's only relevant for a single run, but we still can use the full API response to actually see info about all the runs for a given SHA. I'm not sure yet how to select from that list the current run. If that ends up not being possible, we could at the very least check if there's a single run with status running, and fail the build instructing the user to first cancel the other build that is running for this SHA...

siggy · 2019-10-29T21:38:25Z

I just noticed that INVOCATION_ID and RUNNER_TRACKING_ID change between jobs within a single workflow, so those are not helpful.

siggy · 2019-10-29T22:00:24Z

@alpeb Thanks for all the digging. That check suite approach is interesting, though I think it grabs the latest check_suite item, which could change between workflow jobs.

I tested with:

curl -s -H "Accept: application/vnd.github.antiope-preview+json" "https://api.github.com/repos/linkerd/linkerd2/commits/4ea87eae0dfbbad33698155a75b1842573020321/check-suites"

I'm reluctant to just fail the build. Let's hold off on doing this until there's a way to uniquely identify a workflow run (or it becomes painful enough to warrant failing the build).

alpeb · 2019-10-29T22:39:11Z

I think it grabs the latest check_suite item, which could change between workflow jobs.

I think I don't understand your objection. From what I've tested, that API will return an array where each element corresponds to each github actions build run that has been triggered for the supplied sha. Each element describes the status (running, completed), the conclusion (success, failure) and (from the looks of it) an id that is unique for that build run.

I'm ok with holding off on this one, just wanted to clarify we're on the same page.

stale · 2020-01-27T22:45:41Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Fixes #3911 Refactors the `cloud_integration` test to run in separate GKE clusters that are created and torn down on the fly. It leverages a new "gcloud" github action that is also used to set up gcloud in other build steps (`docker_deploy` and `chart_deploy`). The action also generates unique names for those clusters, based on the git commit SHA and `run_id`, a recently introduced variable that is unique per CI run and available to all the jobs. This fixes part of #3635 in that CI runs on the same SHA don't interfere with one another (in the `cloud_integration` test; still to do for `kind_integration`). The "gcloud" GH action is supported by `.github/actions/gcloud/index.js` that has a couple of dependencies. To avoid having to commit `node_modules`, after every change to that file one must run ```bash # only needed the first time npm i -g @zeit/ncc cd .github/actions/gcloud ncc build index.js ``` which generates the self-contained file `.github/actions/gcloud/dist/index.js`. (This last part might get easier in the future after other refactorings outside this PR).

* Allow CI to run concurrent builds in master Fixes #3911 Refactors the `cloud_integration` test to run in separate GKE clusters that are created and torn down on the fly. It leverages a new "gcloud" github action that is also used to set up gcloud in other build steps (`docker_deploy` and `chart_deploy`). The action also generates unique names for those clusters, based on the git commit SHA and `run_id`, a recently introduced variable that is unique per CI run and available to all the jobs. This fixes part of #3635 in that CI runs on the same SHA don't interfere with one another (in the `cloud_integration` test; still to do for `kind_integration`). The "gcloud" GH action is supported by `.github/actions/gcloud/index.js` that has a couple of dependencies. To avoid having to commit `node_modules`, after every change to that file one must run ```bash # only needed the first time npm i -g @zeit/ncc cd .github/actions/gcloud ncc build index.js ``` which generates the self-contained file `.github/actions/gcloud/dist/index.js`. (This last part might get easier in the future after other refactorings outside this PR). * Run integration tests for forked repos Signed-off-by: Kevin Leimkuhler <[email protected]> * Address reviews Signed-off-by: Kevin Leimkuhler <[email protected]> * Address more reviews Signed-off-by: Kevin Leimkuhler <[email protected]> * Move some conditionals to jobs Signed-off-by: Kevin Leimkuhler <[email protected]> * Change job name Signed-off-by: Kevin Leimkuhler <[email protected]> * Move more conditionals to job level Signed-off-by: Kevin Leimkuhler <[email protected]> * Added more flags to 'gcloud container clusters create' and consolidated 'create' and 'destroy' into ' action' * Run kind cleanup only for non-forked PRs Signed-off-by: Kevin Leimkuhler <[email protected]> * Got rid of cloud_cleanup by using a post hook in the gcloud action * Removed cluster naming responsibility from the gcloud action * Consolidate .gitignore statements * Removed bin/_gcp.sh * Change name of Kind int. test job Signed-off-by: Kevin Leimkuhler <[email protected]> * Ensure `kind_cleanup` still runs on cancelled host CI runs Signed-off-by: Kevin Leimkuhler <[email protected]> * Add reviews Signed-off-by: Kevin Leimkuhler <[email protected]> * Update workflow comment Signed-off-by: Kevin Leimkuhler <[email protected]> * Split index.js into setup.js and destroy.js * trigger build * Moved the gcloud action into its own repo * Full version for the gcloud GH action * Rebase back to master Signed-off-by: Kevin Leimkuhler <[email protected]> * Remvoe additional changes Signed-off-by: Kevin Leimkuhler <[email protected]> * Remove additional changes Signed-off-by: Kevin Leimkuhler <[email protected]> * Trigger CI Signed-off-by: Kevin Leimkuhler <[email protected]> Co-authored-by: Alejandro Pedraza <[email protected]>

* Allow CI to run concurrent builds in master Fixes #3911 Refactors the `cloud_integration` test to run in separate GKE clusters that are created and torn down on the fly. It leverages a new "gcloud" github action that is also used to set up gcloud in other build steps (`docker_deploy` and `chart_deploy`). The action also generates unique names for those clusters, based on the git commit SHA and `run_id`, a recently introduced variable that is unique per CI run and available to all the jobs. This fixes part of #3635 in that CI runs on the same SHA don't interfere with one another (in the `cloud_integration` test; still to do for `kind_integration`). The "gcloud" GH action is hosted under its own repo in https://github.com/linkerd/linkerd2-action-gcloud

alpeb · 2020-02-10T23:09:59Z

This was partially addressed in #4001, and will be finished in #4033

stale · 2020-05-11T11:35:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

alpeb · 2020-05-11T13:34:39Z

Fixed in #4033!

siggy added the area/build label Oct 24, 2019

siggy changed the title ~~Make CI workflow resilient to multiple runs on the same sha~~ Make CI workflow resilient to multiple runs on the same sha or tag Oct 24, 2019

alpeb self-assigned this Oct 25, 2019

stale bot added the wontfix label Jan 27, 2020

alpeb mentioned this issue Jan 30, 2020

Allow CI to run concurrent builds in master #4001

Merged

stale bot closed this as completed Feb 10, 2020

alpeb reopened this Feb 10, 2020

stale bot removed the wontfix label Feb 10, 2020

stale bot added the wontfix label May 11, 2020

alpeb closed this as completed May 11, 2020

github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make CI workflow resilient to multiple runs on the same sha or tag #3635

Make CI workflow resilient to multiple runs on the same sha or tag #3635

siggy commented Oct 24, 2019 •

edited

Loading

alpeb commented Oct 24, 2019

ihcsim commented Oct 24, 2019

siggy commented Oct 24, 2019

alpeb commented Oct 25, 2019

siggy commented Oct 25, 2019

alpeb commented Oct 25, 2019 •

edited

Loading

alpeb commented Oct 25, 2019

alpeb commented Oct 25, 2019 •

edited

Loading

siggy commented Oct 29, 2019

siggy commented Oct 29, 2019

alpeb commented Oct 29, 2019

stale bot commented Jan 27, 2020

alpeb commented Feb 10, 2020

stale bot commented May 11, 2020

alpeb commented May 11, 2020

Make CI workflow resilient to multiple runs on the same sha or tag #3635

Make CI workflow resilient to multiple runs on the same sha or tag #3635

Comments

siggy commented Oct 24, 2019 • edited Loading

Background

Problem

Collision examples

Proposal

Example code sites requiring update

kind cluster name

git Repo artifact

kubeconfig file

alpeb commented Oct 24, 2019

ihcsim commented Oct 24, 2019

siggy commented Oct 24, 2019

alpeb commented Oct 25, 2019

siggy commented Oct 25, 2019

alpeb commented Oct 25, 2019 • edited Loading

alpeb commented Oct 25, 2019

alpeb commented Oct 25, 2019 • edited Loading

siggy commented Oct 29, 2019

siggy commented Oct 29, 2019

alpeb commented Oct 29, 2019

stale bot commented Jan 27, 2020

alpeb commented Feb 10, 2020

stale bot commented May 11, 2020

alpeb commented May 11, 2020

siggy commented Oct 24, 2019 •

edited

Loading

alpeb commented Oct 25, 2019 •

edited

Loading

alpeb commented Oct 25, 2019 •

edited

Loading