Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CI workflow resilient to multiple runs on the same sha or tag #3635

Closed
siggy opened this issue Oct 24, 2019 · 15 comments
Closed

Make CI workflow resilient to multiple runs on the same sha or tag #3635

siggy opened this issue Oct 24, 2019 · 15 comments
Assignees

Comments

@siggy
Copy link
Member

siggy commented Oct 24, 2019

Background

PR #3602 guaranteed that all jobs in a CI workflow received the same copy of the repo. This made our CI workflow resilient to code changes occurring during the workflow run, both in master and the PR itself.

Problem

Our workflow uses the sha of the repo to name global objects that are shared between jobs within a workflow. When two workflow runs occur on the same sha or tag, these two runs will collide. Specifically, in the names of the kind clusters and git repo artifacts they create.

Collision examples

Proposal

Assign unique names to kind clusters and git repo artifacts, based on the current workflow run, in addition to the output of bin/root-tag.

For example, instead of:

TAG="$(CI_FORCE_CLEAN=1 bin/root-tag)"
export KIND_CLUSTER=github-$TAG-${{ matrix.integration_test }}`

do something like:

TAG="$(CI_FORCE_CLEAN=1 bin/root-tag)"
export KIND_CLUSTER=github-$TAG-${{ runner.tracking_id }}-${{ matrix.integration_test }}`

(I'm not sure if runner.tracking_id exists or is too long, but we need something unique like this, or something in the {{job}} context.) More info at https://help.github.com/en/github/automating-your-workflow-with-github-actions/contexts-and-expression-syntax-for-github-actions.

Example code sites requiring update

kind cluster name

TAG="$(CI_FORCE_CLEAN=1 bin/root-tag)"
export KIND_CLUSTER=github-$TAG-${{ matrix.integration_test }}

git Repo artifact

name: linkerd2-src-${{ github.sha }}
path: linkerd2.${{ github.sha }}.tar.gz

kubeconfig file

Note this filename is based on the kind cluster name, so fixing kind cluster naming should fix this automatically:

scp github@$DOCKER_ADDRESS:/tmp/kind-config-$KIND_CLUSTER $HOME/.kube

@alpeb
Copy link
Member

alpeb commented Oct 24, 2019

I'm wondering when two runs on the same sha can occur. Is it only when an empty commit to a branch is pushed to trigger the build, when the branch still has a build running?

@ihcsim
Copy link
Contributor

ihcsim commented Oct 24, 2019

I'm wondering when two runs on the same sha can occur.

Like a new tag and the latest master HEAD? Or do we handle that differently now in workflow.yaml?

@siggy siggy changed the title Make CI workflow resilient to multiple runs on the same sha Make CI workflow resilient to multiple runs on the same sha or tag Oct 24, 2019
@siggy
Copy link
Member Author

siggy commented Oct 24, 2019

@alpeb @ihcsim Good points. I've updated the issue title and description to better describe the situation. We don't exactly rely on the sha, but rather on the output of bin/root-tag, which is either a sha, or a tag (if one exists).

The situations where this can occur include:

  • master is updated to a commit, then soon after a tag is set to that same commit
  • the repo is tagged at a commit, then soon after the same tag is updated to a different commit

@alpeb alpeb self-assigned this Oct 25, 2019
@alpeb
Copy link
Member

alpeb commented Oct 25, 2019

Turns out the job and runner context are pretty limited:

job:
{
  "status": "Success"
}

runner:
{
  "os": "Linux",
  "tool_cache": "/opt/hostedtoolcache",
  "temp": "/home/runner/work/_temp",
  "workspace": "/home/runner/work/linkerd2"
}

and I couldn't find anything useful in the github context either.

I'm gonna try generating a random number in bash and storing it in the workflow's temp dir. Will have to check that dir is not shared across runs.

@siggy
Copy link
Member Author

siggy commented Oct 25, 2019

@alpeb I could be wrong, but I wonder if the GitHub Actions UI is obfuscating additional fields from the runner and job contexts? I ask because I can see env vars like RUNNER_TRACKING_ID that are presumably mapped from a context?

https://github.com/linkerd/linkerd2/runs/273480582#step:2:50

@alpeb
Copy link
Member

alpeb commented Oct 25, 2019

I tested this:

    - name: Dump runner.tracking_id
      env:
        RUNNER_ID: ${{ runner.tracking_id }}
      run: echo "$RUNNER_ID"
    - name: Check if runner.tracking_id is empty
      env:
        RUNNER_ID: ${{ runner.tracking_id == ''}}
      run: echo "$RUNNER_ID"
    - name: Dump runner.tracking.id
      env:
        RUNNER_ID: ${{ runner.tracking.id }}
      run: echo "$RUNNER_ID"
    - name: Check if runner.tracking.id is empty
      env:
        RUNNER_ID: ${{ runner.tracking.id == ''}}
      run: echo "$RUNNER_ID"

which showed runner.tracking_id and runner.tracking.id don't exist.
I'll give RUNNER_TRACKING_ID a try.

@alpeb
Copy link
Member

alpeb commented Oct 25, 2019

Turns out not all environment variables are carried over when using the container directive. For example here I'm dumping the env vars in the go_unit_tests step, and RUNNER_TRACKING_ID and friends are missing:
https://github.com/alpeb/linkerd2/runs/275474709#step:3:25

@alpeb
Copy link
Member

alpeb commented Oct 25, 2019

I'm gonna try generating a random number in bash and storing it in the workflow's temp dir. Will have to check that dir is not shared across runs.

Answering myself :-P I had forgotten that each job uses a brand new environment and so the temp dir starts anew for each job.

OTOH, I stumbled upon this that describes a way to retrieve the "check suite ID" through github's API and jq. That's only relevant for a single run, but we still can use the full API response to actually see info about all the runs for a given SHA. I'm not sure yet how to select from that list the current run. If that ends up not being possible, we could at the very least check if there's a single run with status running, and fail the build instructing the user to first cancel the other build that is running for this SHA...

@siggy
Copy link
Member Author

siggy commented Oct 29, 2019

I just noticed that INVOCATION_ID and RUNNER_TRACKING_ID change between jobs within a single workflow, so those are not helpful.

@siggy
Copy link
Member Author

siggy commented Oct 29, 2019

@alpeb Thanks for all the digging. That check suite approach is interesting, though I think it grabs the latest check_suite item, which could change between workflow jobs.

I tested with:

curl -s -H "Accept: application/vnd.github.antiope-preview+json" "https://api.github.com/repos/linkerd/linkerd2/commits/4ea87eae0dfbbad33698155a75b1842573020321/check-suites"

I'm reluctant to just fail the build. Let's hold off on doing this until there's a way to uniquely identify a workflow run (or it becomes painful enough to warrant failing the build).

@alpeb
Copy link
Member

alpeb commented Oct 29, 2019

I think it grabs the latest check_suite item, which could change between workflow jobs.

I think I don't understand your objection. From what I've tested, that API will return an array where each element corresponds to each github actions build run that has been triggered for the supplied sha. Each element describes the status (running, completed), the conclusion (success, failure) and (from the looks of it) an id that is unique for that build run.

I'm ok with holding off on this one, just wanted to clarify we're on the same page.

@stale
Copy link

stale bot commented Jan 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 27, 2020
alpeb added a commit that referenced this issue Jan 30, 2020
Fixes #3911

Refactors the `cloud_integration` test to run in separate GKE clusters
that are created and torn down on the fly.
It leverages a new "gcloud" github action that is also used to set up
gcloud in other build steps (`docker_deploy` and `chart_deploy`).

The action also generates unique names for those clusters, based on the
git commit SHA and `run_id`, a recently introduced variable that is
unique per CI run and available to all the jobs.
This fixes part of #3635 in that CI runs on the same SHA don't interfere
with one another (in the `cloud_integration` test; still to do for
`kind_integration`).

The "gcloud" GH action is supported by `.github/actions/gcloud/index.js`
that has a couple of dependencies. To avoid having to commit
`node_modules`, after every change to that file one must run
```bash
# only needed the first time
npm i -g @zeit/ncc

cd .github/actions/gcloud
ncc build index.js
```
which generates the self-contained file
`.github/actions/gcloud/dist/index.js`.
(This last part might get easier in the future after other refactorings
outside this PR).
admc pushed a commit that referenced this issue Feb 7, 2020
* Allow CI to run concurrent builds in master

Fixes #3911

Refactors the `cloud_integration` test to run in separate GKE clusters
that are created and torn down on the fly.
It leverages a new "gcloud" github action that is also used to set up
gcloud in other build steps (`docker_deploy` and `chart_deploy`).

The action also generates unique names for those clusters, based on the
git commit SHA and `run_id`, a recently introduced variable that is
unique per CI run and available to all the jobs.
This fixes part of #3635 in that CI runs on the same SHA don't interfere
with one another (in the `cloud_integration` test; still to do for
`kind_integration`).

The "gcloud" GH action is supported by `.github/actions/gcloud/index.js`
that has a couple of dependencies. To avoid having to commit
`node_modules`, after every change to that file one must run
```bash
# only needed the first time
npm i -g @zeit/ncc

cd .github/actions/gcloud
ncc build index.js
```
which generates the self-contained file
`.github/actions/gcloud/dist/index.js`.
(This last part might get easier in the future after other refactorings
outside this PR).

* Run integration tests for forked repos

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Address reviews

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Address more reviews

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Move some conditionals to jobs

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Change job name

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Move more conditionals to job level

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Added more flags to 'gcloud container clusters create' and consolidated
'create' and 'destroy' into ' action'

* Run kind cleanup only for non-forked PRs

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Got rid of cloud_cleanup by using a post hook in the gcloud action

* Removed cluster naming responsibility from the gcloud action

* Consolidate .gitignore statements

* Removed bin/_gcp.sh

* Change name of Kind int. test job

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Ensure `kind_cleanup` still runs on cancelled host CI runs

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Add reviews

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Update workflow comment

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Split index.js into setup.js and destroy.js

* trigger build

* Moved the gcloud action into its own repo

* Full version for the gcloud GH action

* Rebase back to master

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Remvoe additional changes

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Remove additional changes

Signed-off-by: Kevin Leimkuhler <[email protected]>

* Trigger CI

Signed-off-by: Kevin Leimkuhler <[email protected]>

Co-authored-by: Alejandro Pedraza <[email protected]>
alpeb added a commit that referenced this issue Feb 7, 2020
* Allow CI to run concurrent builds in master

Fixes #3911

Refactors the `cloud_integration` test to run in separate GKE clusters
that are created and torn down on the fly.
It leverages a new "gcloud" github action that is also used to set up
gcloud in other build steps (`docker_deploy` and `chart_deploy`).

The action also generates unique names for those clusters, based on the
git commit SHA and `run_id`, a recently introduced variable that is
unique per CI run and available to all the jobs.
This fixes part of #3635 in that CI runs on the same SHA don't interfere
with one another (in the `cloud_integration` test; still to do for
`kind_integration`).

The "gcloud" GH action is hosted under its own repo in https://github.com/linkerd/linkerd2-action-gcloud
@stale stale bot closed this as completed Feb 10, 2020
@alpeb
Copy link
Member

alpeb commented Feb 10, 2020

This was partially addressed in #4001, and will be finished in #4033

@alpeb alpeb reopened this Feb 10, 2020
@stale stale bot removed the wontfix label Feb 10, 2020
@stale
Copy link

stale bot commented May 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label May 11, 2020
@alpeb
Copy link
Member

alpeb commented May 11, 2020

Fixed in #4033!

@alpeb alpeb closed this as completed May 11, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants