feat: automate performance benchmarking #2

galargh · 2023-05-25T15:22:50Z

This PR updates infrastructure setup and creates a GitHub Action workflow for performance testing (libp2p#183).

Not covered (yet)

It doesn't cover the following items from libp2p#183:

They trigger the automation via a GitHub comment.

In this iteration, the automation can be triggered by clicking Run workflow in https://github.com/libp2p/test-plans/actions/workflows/perf.yml.

Push perf/runner/benchmark-results.json to the pull request.

Right now the results are saved to workflow artifacts on GitHub and S3 bucket. Pushing it to the branch the workflow runs on is no problem either. I missed this.

Implementation

Infrastructure

I split the infra into long-lived and short-lived parts. This makes starting/stopping instances in CI easier. I updated the README to reflect the changes I introduced.

Long-lived

Everything that's defined in terraform. The new things I added there:

I replaced aws_instance resources with aws_instance_launch_template
I added a configuration for S3 bucket (used for storing result JSONs)
I added a configuration for a bot user (its credentials are used in CI)
I added a configuration for a lambda which periodically checks for running instances (runs every hour, deletes perf EC2 isntances that have been running for more than 50 minutes)

This part of infra is supposed to incur only negligible cost.

Short-lived

It's only EC2 instances. That's it. Everything else can be long-lived. We don't have to wait for it to come up every time. We don't have to worry about deleting it every time.

I added a Makefile which takes care of managing short-lived infra.

GHA

The workflow is defined in the perf.yml. For now, I configured it so that it can be triggered on demand or called from another workflow.

To run the workflow in a repository, additional configuration has to be done - set up of the long-lived infrastructure. I described how to do it exactly in the workflow file - https://github.com/galorgh/test-plans/blob/f5a25258021b36a1851f1c9719b54395a9733ba2/.github/workflows/perf.yml#L3-L12.

Testing

I successfully ran the GHA workflow in my org - https://github.com/galorgh/test-plans/actions/runs/5080167881/attempts/1#summary-13756441047

Other

For testing, I set up the long-lived infrastructure in my AWS account. I'd suggest setting it up in the libp2p's AWS account.

You'll be able to trigger the workflow only after it is merged to the default branch.

This reverts commit f76bf67.

This reverts commit 516d708.

mxinden · 2023-05-26T02:08:03Z

@galargh to make collaboration easier, I moved to an upstream branch. Would you mind re-opening this pull request against libp2p#184, more specifically https://github.com/libp2p/test-plans/tree/perf?

mxinden · 2023-05-26T03:20:18Z

Thank you @galargh for this work!

I split the infra into long-lived and short-lived parts.

Where would the .tfstate file of the long-lived terraform-provisioned infrastructure live?

Right now the results are saved to workflow artifacts on GitHub and S3 bucket.

I don't think this is necessary. For the sake of simplicity, I suggest only pushing the results to the branch of the pull request. Gives us a single source-of-truth, fully version controlled.

You'll be able to trigger the workflow only after it is merged to the default branch.

It is necessary to trigger the workflow before merging into the default branch (i.e. master). Otherwise how would one test a release candidate. Are there any larger blockers to triggering it before the merge?

I successfully ran the GHA workflow in my org - https://github.com/galorgh/test-plans/actions/runs/5080167881/attempts/1#summary-13756441047

😎

mxinden · 2023-05-26T03:22:30Z

.github/workflows/perf.yml

+
+# How to configure a repository for running this workflow:
+# 1. Run 'make ssh-keygen' in 'perf' to generate a new SSH key pair named 'user' in 'perf/terraform/region/files'
+# 2. Export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for the account of your choice


Suggested change

# 2. Export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for the account of your choice

# 2. Configure your AWS credentials, e.g. by exporting AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or writing `~/.aws/credentials` for the account of your choice

Either should still work, right?

Yes, of course. I'll link to https://registry.terraform.io/providers/hashicorp/aws/latest/docs#authentication-and-configuration instead which has all the options listed.

mxinden · 2023-05-26T03:27:33Z

.github/workflows/perf.yml

+      - name: Configure SSH
+        uses: webfactory/ssh-agent@d4b9b8ff72958532804b70bbe600ad43b36d5f2e # v0.8.0
+        with:
+          ssh-private-key: ${{ secrets.PERF_SSH_PRIVATE_KEY }}


I guess with the move to AWS launch templates there is no easy way for ephemeral SSH keys? Ephemeral keys would eliminate the need to manage long lived credentials. Please ignore in case ephemeral keys would add more complexity.

We could keep generating keys on the fly. We still do have to keep managing the long-lived AWS credentials. And if SSH keys are ephemeral, then those AWS creds have to be allowed to create AWS key pairs.

I'll add instructions on what it'd take to switch between the two next to the aws_key_pair resource and we can decide then. I don't feel strongly either way.

mxinden · 2023-05-26T03:29:07Z

.github/workflows/perf.yml

+      - name: Archive results
+        uses: actions/upload-artifact@v2
+        with:
+          name: results
+          path: perf/runner/benchmark-results.json
+      - id: s3
+        name: Upload results
+        env:
+          AWS_BUCKET: ${{ inputs.aws-bucket || vars.PERF_AWS_BUCKET }}
+          AWS_BUCKET_PATH: ${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/benchmark-results.json
+        run: |
+          aws s3 cp benchmark-results.json s3://$AWS_BUCKET/$AWS_BUCKET_PATH --acl public-read --region us-west-2
+          echo "url=https://$AWS_BUCKET.s3.amazonaws.com/$AWS_BUCKET_PATH" >> $GITHUB_OUTPUT
+        working-directory: perf/runner


I don't think these steps are necessary. See comment above.

Sure! I'll replace it with a push.

mxinden · 2023-05-26T03:31:11Z

perf/Makefile

+# https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-install.html
+scale-down:
+	sam local invoke ScaleDown --template terraform/common/files/scale_down.yml --event terraform/common/files/scale_down.json


Where is this Make target called?

I used it for testing the lambda code. I'll move this next to the where all the other lambda files are and add a more descriptive description.

mxinden · 2023-05-26T03:32:48Z

perf/README.md

@@ -10,21 +10,51 @@ Benchmark results can be visualized with https://observablehq.com/@mxinden-works

 ## Provision infrastructure

-1. `cd terraform`
-2. Save your public SSH key as the file `./user.pub`.
+### Bootstrap


Suggested change

### Bootstrap

### Bootstrap (long-lived resources)

mxinden · 2023-05-26T03:33:06Z

perf/README.md

+export AWS_SECRET_ACCESS_KEY=$(cat perf_accessKeys.csv | tail -n 1 | cut -d, -f2)
+```
+
+### Nodes


Suggested change

### Nodes

### Start nodes (short-lived resources)

mxinden · 2023-05-26T03:47:45Z

Concerns splitting into long-lived and short-lived

I am not sure the here introduced split into short-lived and long-lived resources is a good idea.

It adds complexity. I.e. one now has to reason about two sets of resources.
It adds an additional step to testing against a personal AWS account. (Spin up both long-lived and short-lived resources.)
Testing against a personal AWS account doesn't require the AWS Lambda.
It speeds things up, as everything but the machines no longer need to be spun up per run. Though I don't think the reduced time-to-run matters much for us. As far as I can tell, spinning up the machines is the dominant time factor anyways.
As far as I can tell, it introduces the need for a long lived SSH keypair through the AWS launch template.
It requires us to persist the .tfstate file for the long-lived resources somewhere. (Though arguably we have to do so for the Lambda anyways.)

Alternative approach

Have two terraform projects.

Again, long-lived, deploying the Lambda only.
Short-lived deploying all other resources, introducing the machines themselves.

Benefits:

Allows easy testing against personal AWS account. One does not have to bother with the terraform project launching the Lambda. A single terraform apply is enough.
No long-lived SSH credentials.
Only necessary to persist the .tfstate file of the terraform project deploying the Lamdba.

@galargh let me know what you think. I am in no way an expert. Ultimate goal is to keep things simple. In case the above just adds complexity, please ignore.

galargh · 2023-05-26T09:05:14Z

@galargh to make collaboration easier, I moved to an upstream branch. Would you mind re-opening this pull request against libp2p#184, more specifically https://github.com/libp2p/test-plans/tree/perf?

Sure! I'll reopen the PR against that branch and I'll address all the non-inline comments there. Closing this.

galargh added 15 commits May 24, 2023 20:11

feat: prepare perf for automation

262edcc

chore: make scale down lambda common

be80ae7

feat: make it easier to create a one-off ssh key

6d4e3c4

feat: create a bucket and allow server/client to interact with it

255f3bc

fix: fixes found during testing in IPDX account

561e849

debug: make perf tests quicker for testing automation

f76bf67

feat: give perf user access to perf bucket

433e4e5

feat: implement reusable perf workflow

2bf0c74

debug: use galorgh/test-plans

516d708

fix: setting server/client IP for tests

58397bf

fix: s3 bucket settings

51dbffe

fix: results upload path

2e9c363

Revert "debug: make perf tests quicker for testing automation"

279f4e2

This reverts commit f76bf67.

fix: s3 upload, workflow instructions and timeout defaults

12ba150

Revert "debug: use galorgh/test-plans"

f5a2525

This reverts commit 516d708.

galargh mentioned this pull request May 25, 2023

Automate performance benchmarking libp2p/test-plans#183

Closed

mxinden mentioned this pull request May 26, 2023

feat(perf): add (provision, build, run) tooling libp2p/test-plans#163

Merged

mxinden reviewed May 26, 2023

View reviewed changes

galargh closed this May 26, 2023

galargh mentioned this pull request May 26, 2023

feat: automate performance benchmarking libp2p/test-plans#185

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: automate performance benchmarking #2

feat: automate performance benchmarking #2

galargh commented May 25, 2023

mxinden commented May 26, 2023

mxinden commented May 26, 2023

mxinden May 26, 2023

galargh May 26, 2023

mxinden May 26, 2023

galargh May 26, 2023

mxinden May 26, 2023

galargh May 26, 2023

mxinden May 26, 2023

galargh May 26, 2023

mxinden May 26, 2023

mxinden May 26, 2023

mxinden commented May 26, 2023

galargh commented May 26, 2023

	# 2. Export AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for the account of your choice
	# 2. Configure your AWS credentials, e.g. by exporting AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or writing `~/.aws/credentials` for the account of your choice

feat: automate performance benchmarking #2

feat: automate performance benchmarking #2

Conversation

galargh commented May 25, 2023

Not covered (yet)

Implementation

Infrastructure

Long-lived

Short-lived

GHA

Testing

Other

mxinden commented May 26, 2023

mxinden commented May 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mxinden commented May 26, 2023

Concerns splitting into long-lived and short-lived

Alternative approach

galargh commented May 26, 2023