-
Notifications
You must be signed in to change notification settings - Fork 4
Concourse
This documentation is for running and maintaining Concourse. If you want to deploy a change to production you should see the guidance on merging and deploying
All our apps are deployed and our smoketests run through our own self-managed concourse instance.
This can be found at https://concourse.notify.tools/ (requires VPN to access)
Authentication is handled via Github. There are a few teams within Concourse.
- "Notify"
- Pipelines for deploying Notify across all environments. (Also updating those pipelines themselves)
- managed in https://github.com/alphagov/notifications-concourse-deployment/blob/main/terraform/deployments/concourse/team-notify.tf
- "Main"
- Pipelines for deploying Concourse itself (eg to change the boxes that concourse runs on, change the list of users who are allowed to access concourse)
- managed via the
main_team_github_users
(superadmins of the entire concourse) andmain_team_pipeline_operator_github_users
vars in https://github.com/alphagov/notifications-concourse-deployment/blob/main/terraform/deployments/concourse/site.tf
- "Dev-[a-d]"
- Pipelines for deploying Notify to dev environments (to test changes to infra, etc, without impacting the main pipelines)
- managed via https://github.com/alphagov/notifications-concourse-deployment/blob/main/terraform/deployments/concourse/team-dev.tf
There are a few user classes (strictly ordered, so for example members can also operate pipelines and view them as well)
- owner - can update user auth, create teams, etc. do anything.
- member - can update pipelines etc via the
fly
CLI - pipeline_operator - can trigger pipelines, pause pipelines, pin resources, and see progress, but can't update via the
fly
CLI - viewer - can only view pipeline progress
You can use the fly
CLI to see and modify pipelines for the Notify team.
brew install fly
fly login -c https://concourse.notify.tools/ -n notify -t notify
When Concourse needs access to secrets it gets them in two ways.
-
Concourse will access our credentials repo and retrieve secrets from it. This is generally used as part of a pipeline task.
-
Concourse will access secrets that we have stored in AWS SSM. This is generally used as part of resource configuration because we are unable to get secrets from our credentials repo whilst not in a task
Secrets then be referenced in resources by using the ((double-bracket syntax))
.
To put secrets from our credentials repo into AWS SSM for use outside of tasks, we have a concourse-secrets pipeline. This is configured in https://github.com/alphagov/notifications-aws/blob/master/concourse/concourse-secrets-pipeline.yml.j2.
Some secrets are separately put into AWS SSM as part of the creation of Concourse, for example names of S3 buckets that are created for pipelines to put files into. Secrets created in this way start with readonly
.
You can view metrics around our concourse CPU usage, worker count, etc at https://grafana.monitoring.concourse.notify.tools/. Sign in via your github account.
Our concourse instance is defined in two terraform repositories. They're split for legacy reasons. Once changes are merged to either of these repos, you will need to trigger the deploy from concourse via the "deploy" pipeline. This will take 20 mins or so and may interrupt running jobs as the worker instances rotate, but is otherwise zero-downtime.
Concourse runs within the notify-deploy
AWS environment, and the role can be assumed using the gds cli by senior developers.
Concourse will update itself to the latest version if you unpin the resource here: https://concourse.notify.tools/teams/main/pipelines/deploy/resources/concourse-release (Only notify admins can view and edit this pin)
This repo defines some of the variables that you might expect to change, such as the definition of the info pipeline, how many AWS instances concourse has (and of what instance type), which github users have permission to view/edit the pipelines, the GDS IP addresses to allow access from and other similar variables.
This repo also contains instructions for how we created the concourse from scratch and thoughts from Reliability Engineering on how to manage it.
This repo contains terraform that defines how concourse is hosted and how it interacts with itself e.g. ec2 instances, security groups, route53 DNS records, IAM roles, etc.
When applying terraform changes, concourse sometimes gets into a race condition e.g.
no workers satisfying: resource type 'git', version: '2.3'
We think this is because all the existing workers have been killed as part of the deployment. It's worth waiting a few minutes to see if new workers become available - try manually starting a new run of the job.
Otherwise, rotating the EC2 workers may have failed. Devs can log in to the AWS console (gds aws notify-deploy-admin -l
) and manually start an instance refresh on the autoscaling groups.
If this becomes an issue more commonly, GOV.UK Pay have implemented some changes to make the pipeline more robust that we might want to look in to:
You can restart all the notify workers here:
https://concourse.notify.tools/teams/notify/pipelines/info/jobs/start-worker-refresh/
That job requires a notify worker to function - if it doesn't work, you can restart from the "main" pipeline:
If that doesn't work then devs can log into AWS from the vpn (gds aws notify-deploy-admin -l
) and manually initiate an instance refresh for the worker instances in the ec2 autoscaling groups.