Skip to content

Concourse

David McDonald edited this page Jun 20, 2022 · 6 revisions

This documentation is for running and maintaining Concourse. If you want to deploy a change to production you should see the guidance on merging and deploying

All our apps are deployed and our smoketests run through our own self-managed concourse instance.

This can be found at https://concourse.notify.tools/ (requires VPN to access)

User authentication

Authentication is handled via Github. See this terraform file for which users get which permissions. Note, this is for the "Notify" team, which controls our pipelines for deploying notify. There is also the "main" team which controls changes to the actual concourse infrastructure itself. The access for the main team is controlled in this terraform file.

Making changes to Notify's concourse pipelines

Using fly

You can use the fly CLI to see and modify pipelines for the Notify team.

brew install fly

fly login -c https://https://concourse.notify.tools/ -n notify -t notify

Working with secrets

When Concourse needs access to secrets it gets them in two ways.

  1. Concourse will access our credentials repo and retrieve secrets from it. This is generally used as part of a pipeline task.

  2. Concourse will access secrets that we have stored in AWS SSM. This is generally used as part of resource configuration because we are unable to get secrets from our credentials repo whilst not in a task

Secrets then be referenced in resources by using the ((double-bracket syntax)).


To put secrets from our credentials repo into AWS SSM for use outside of tasks, we have a [concourse-secrets pipeline](https://concourse.notify.tools/teams/notify/pipelines/concourse-secrets). This is configured in https://github.com/alphagov/notifications-aws/blob/master/concourse/concourse-secrets-pipeline.yml.j2.

Some secrets are separately put into AWS SSM as part of the creation of Concourse, for example names of S3 buckets that are created for pipelines to put files into. Secrets created in this way start with `readonly`.

## Monitoring our concourse instance

You can view metrics around our concourse CPU usage, worker count, etc at https://grafana.monitoring.concourse.notify.tools/. Sign in via your github account.

## Making changes to our concourse instance

Our concourse instance is defined in two terraform repositories. They're split for legacy reasons.
Once changes are merged to either of these repos, you will need to trigger the deploy from concourse via the ["deploy" pipeline](https://concourse.notify.tools/teams/main/pipelines/deploy). This will take 20 mins or so and may interrupt running jobs as the worker instances rotate, but is otherwise zero-downtime.

Concourse runs within the `notify-deploy` AWS environment, and the role can be assumed using the gds cli by senior developers.

### Updating the concourse version

Concourse will update itself to the latest version if you unpin the resource here: https://concourse.notify.tools/teams/main/pipelines/deploy/resources/concourse-release (Only notify admins can view and edit this pin)

### [notifications-concourse-deployment](https://github.com/alphagov/notifications-concourse-deployment)

This repo defines some of the variables that you might expect to change, such as the definition of the info pipeline, how many AWS instances concourse has (and of what instance type), which github users have permission to view/edit the pipelines, the GDS IP addresses to allow access from and other similar variables.

This repo also contains instructions for how we created the concourse from scratch and thoughts from Reliability Engineering on how to manage it.

### [notifications-concourse](https://github.com/alphagov/notifications-concourse)

This repo contains terraform that defines how concourse is hosted and how it interacts with itself e.g. ec2 instances, security groups, route53 DNS records, IAM roles, etc.

## Troubleshooting

### If a concourse deploy gets stuck

When applying terraform changes, concourse sometimes gets into a race condition e.g.

no workers satisfying: resource type 'git', version: '2.3'


We think this is because all the existing workers have been killed as part of the deployment. It's worth waiting a few minutes to see if new workers become available - try manually starting a new run of the job.

Otherwise, rotating the EC2 workers may have failed. Devs can log in to the AWS console (`gds aws notify-deploy-admin -l`) and manually start an instance refresh on the autoscaling groups.

If this becomes an issue more commonly, GOV.UK Pay have implemented some changes to make the pipeline more robust that we might want to look in to:

* [pull/48](https://github.com/alphagov/pay-concourse/pull/48), [pull/49](https://github.com/alphagov/pay-concourse/pull/49)
* [pull/50](https://github.com/alphagov/pay-concourse/pull/50), [pull/51](https://github.com/alphagov/pay-concourse/pull/51), [pull/52](https://github.com/alphagov/pay-concourse/pull/52)
* [pull/53](https://github.com/alphagov/pay-concourse/pull/53)

### If a worker gets stuck

You can restart all the notify workers here:

> https://concourse.notify.tools/teams/notify/pipelines/info/jobs/start-worker-refresh/

That job requires a notify worker to function - if it doesn't work, you can restart from the "main" pipeline:

> https://concourse.notify.tools/teams/main/pipelines/roll-instances/jobs/roll-notify-concourse-workers/

If _that_ doesn't work then devs can log into AWS from the vpn (`gds aws notify-deploy-admin -l`) and manually initiate an instance refresh for the worker instances in the ec2 autoscaling groups.
Clone this wiki locally