Skip to content

Support Runbook

Jani Kraner edited this page Nov 4, 2024 · 252 revisions

Table of contents

Updating services/orgs

Letters

Requests for stats

Infra related stuff (AWS, Redis, etc)

Document download

Other


Handle any security vulnerabilities and dependency updates

[this is not a daily support task yet, we need to clean up and prepare for it, so we start with a clean slate]

When you are on support, spend some time making sure that our dependencies are secure and up to date:

  1. If there is a Zendesk ticket about security vulnerabilities, action it like you would any other tickets - try to merge in patches that make those dependencies secure again, and if you are too busy, ask other team members for help.
  2. Look at the list of open PRs across our repos and process any open dependency update PRs. Again, if you are too busy to do it on your own, ask for help, don’t let it slide.

How to process a dependency update PR:

  1. Skim through the changelog for the dependency to see if there are any things that could obviously affect us
  2. Lock down any sub-dependencies
  • Python: Check out the branch, run make-freeze-requirements and commit the resulting requirements.txt, and fix any errors returned by the Concourse build.
  • Node: the Dependabot commit already includes all the changes to package.json and package-lock.json needed to update sub-dependencies
  • Terraform: likely Dependabot does it for us, similarly to Node
  1. If tests are passing and no obvious issues, merge and deploy the PR
  2. If you find some issues, if you have time, try to fix them, if you are busy, either ask the team for help or write up and prioritise a Pivotal ticket for that update.

Changing service rate limits

Based on decisions made in a team discussion.

Any request to increase a rate limit should be posted in Slack to allow the team to comment. You don’t have to wait for comments from the team or further information from the service if the request is about email or SMS and:

  • The messages are time-critical e.g. MFA, Covid results.
  • The service has started getting 429 responses and can’t handle them (yet).
  • The service wants to increase their daily limit to 5M or below.

If the requested increase is not specific (e.g. if the service is having problems and just needs some more headroom) or the reasoning for a specific increase is unclear, suggest a low initial increase of 50% on the default limit e.g. 3000/min => 4500/min.

After taking any immediate action (see above) you should ask the service how long they need the new limits and why. Make the increase temporary if possible. You only have to wait for comments from the team if the increase is for longer than a month.

All requests should be recorded in a Zendesk ticket, even if they come from elsewhere. To make the increase temporary, set “Type” to “Task” and “Due date” to when it should expire, then document the previous limits in an internal note, and submit as “Pending”.

It’s worth checking we’re still within our capacity:

  • Email: check daily emails are still well below our daily SES capacity (30M).
  • SMS: check daily texts are still well below our daily SMS capacity (~10M).

Rerun a task in production

gds aws notify-preview-admin -- ecs-exec.sh
flask shell
# then run the appropriate task, for example
from app.celery.letters_pdf_tasks import collate_letter_pdfs_to_be_sent
collate_letter_pdfs_to_be_sent.apply_async(queue='periodic-tasks')

Deal with services with high failure rates or sending sms to tv numbers

Relates to an automated ticket in Zendesk.

This means the service is wasting a lot of money sending SMS notifications that aren't being received. Even though the notifications aren't delivered (permanent-failure), we still get billed for them.

Possible reasons for the failures include:

  • The service is using a live API key and sending to TV numbers. TV numbers are fake and in the range 07700 900000 to 900999. The service should use a test API key, or use the specific numbers we give out for smoke testing.

  • The notifications are incorrectly being marked as spam and blocked at the provider/receiver end.

  • The service has out-of-date phone records for their users. In this case we should check they're aware of the failures, and use another contact method to update their records.

You should do a little investigation to see if there's a pattern in the failures. In all cases, you should email the service from your normal work email to explain the problem and suggest a next step. Send the email to at least two of the admins for the service in question.

Deal with letter manually printed by postal provider

Sometimes we get an email from our postal provider to say they had to manually print a letter. This can be because they have included font small than 6pt for example. If they have manually printed the letter will will not get a delivery receipt for the letter.

Typically they attach the letter to the email, which means we can find the notification id from the db.

Select * from notifications where reference=?

Then update the letter to delivered.

UPDATE notifications
SET notification_status = 'delivered', updated_at=now()
where id = ?

Deal with Letters still in sending

We have an alert that runs at 17:00 and creates a Zendesk ticket if there are letters that haven't been sent from several days ago.

The ticket CCs a Google Group which will automatically notify DVLA about this. Any public replies we make on the ticket will also go to the group. Any replies from DVLA will appear as internal notes on the ticket and the replier will be added to the CC list.

  1. Check if the receipts have arrived since the ticket was created. If they have, go to the last step

    SELECT count(id),  date(sent_at) sent_at, notification_status
       FROM notifications
    WHERE notification_type = 'letter'
        AND key_type='normal'
        AND sent_at >= date(now()) - interval '7 days'
    GROUP BY date(sent_at), notification_status
    ORDER BY 2;
  2. Check if all the tasks to process files have been completed.

    Search in kibana for "rs.txt", and make sure that there are equal amounts of the following 5 log lines

    Received file from DVLA: root/dispatch/<filename>-rs.txt
    update_letter_notifications_statuses has started for filename root/dispatch/<filename>-rs.txt
    update_letter_notifications_statuses has finished for filename root/dispatch/<filename>-rs.txt
    record_daily_sorted_counts has started for filename root/dispatch/<filename>-rs.txt
    record_daily_sorted_counts has finished for filename root/dispatch/<filename>-rs.txt
    

    One way to tell is that if the number of log lines doesn't divide by 5, then we are missing something!

    If we are missing a log line, you'll need to re-run those tasks.

    from app.celery.tasks import (
        record_daily_sorted_counts,
        update_letter_notifications_statuses,
    )
    from app.config import QueueNames
    
    for filename in [
        "root/dispatch/Notify-12341234123412341234-rs.txt", ......
    ]:
        # these tasks are idempotent so can be run twice
        record_daily_sorted_counts.apply_async([filename], queue=QueueNames.NOTIFY)
        update_letter_notifications_statuses.apply_async([filename], queue=QueueNames.NOTIFY)

    Check if everything has been updated to sending succesfully and then go to the last step

  3. If the receipts have not arrived, send a public reply to the ticket asking DVLA for an update on the missing receipts.

    If DVLA are still having trouble identifying the letters that are still in sending then you could attach a CSV with a list of affected letters:

    SELECT service_id, postage, template_id, array_agg(reference) as references
    FROM notifications
    WHERE
      notification_type = 'letter' AND
      notification_status = 'sending' AND
      key_type='normal' AND
      date(sent_at) = '<date>'
    GROUP BY service_id, postage, template_id
  4. Once DVLA have confirmed that they have sent missing delivery receipts, double-check it in the DB using the query above.

  5. Make a public reply in the Zendesk ticket (so DVLA know) and submit it as solved.

Deal with Letters still in created

We have an alert that runs at 07:00 every morning Monday to Fridays. There can be more than one reason that the letters did not get picked up in the nightly job to send to the postal provider. In most cases I recommend re-running the collate-letter-pdfs-to-be-sent task. This will look for letters due to be sent and is safe to run the next morning.

It's worth checking the content of the letter and the service that sent it, to help decide.

You should consider re-running the task if:

  • it's Friday, so the delay users experience will be more than 1 day
  • the content of the letter sent is of high importance (ie. information the user needs to know as soon as possible) or time-based to a near date (ie. about an appointment in the next week)
  • you view the ticket early in the morning (DVLA can sometimes still print the letters early enough along with the previous night's batch)
  • any of the letters have first-class postage

It might not be worth running the task if:

  • the intent of the letter wouldn't be affected by a day's delay (ie. copies of documents sent for legal purposes or marketing letters)
  • the impact to the service is low due to number of letters affected (ie. it's a large-volume service sending a non-critical letter and only a few were still in created)

You can use this query to find all the notifications still in created but should be in sending.

SELECT count(*)
 FROM notifications
WHERE created_at <= '<SOME DATE YYYY-MM-DD> 1<6/7>:30'
  AND notification_status = 'created'
  AND notification_type = 'letter';

To send the letters:

gds aws notify-prod-admin -- ecs-exec.sh
flask shell
from datetime import datetime
from app.celery.letters_pdf_tasks import collate_letter_pdfs_to_be_sent
from notifications_utils.timezones import convert_bst_to_utc

# year/month/day should be the day from which letters are still sending
# if you're here from a morning support ticket, then it's likely yesterday.
print_run_deadline_utc = convert_bst_to_utc(datetime(<year>, <month>, <day>, 17, 30, 0)).isoformat()
collate_letter_pdfs_to_be_sent.apply_async(kwargs={"print_run_deadline_utc_str": print_run_deadline_utc}, queue="periodic-tasks")

Deal with letters not having been sent to DVLA

The letter life cycle is complicated and has many moving parts. When there is an issue with sending the files to DVLA there are a few things to consider and the actions you take will depend on what state everything is in. If you've been alerted that DVLA has not received all the files:

  1. Assess what happened:
  • did the letter PDF get virus scanned/sanitised/generated in time? if the letter was created before 5:30, but the updated at was after 5:50, that could be the issue.
  • did the task try and send that letter via the https api but crash?
  1. Decide whether to resend letters or wait until the next day’s collate task. Questions to ask:
  • did 1st class letters got sent before the app failed?
  • What is the time of day? It is worth it to resend in the morning, but maybe not in the afternoon. Late in the afternoon it could even interfere with the regular task at 5:30PM
  1. See if letters were sent over successfully. You can do it by checking if those notification changed status from created to sending. You can use this query:
SELECT count(id),  date(created_at) sent_at, notification_status
  FROM notifications
 WHERE notification_type = 'letter'
   AND key_type='normal'
GROUP BY date(created_at), notification_status
ORDER BY 2;

Updating letters or other notifications to either technical-failure or permanent-failure

We no longer manually set letters to technical-failure.

Any letter that cannot be printed should be set to permanent-failure.

  1. Check for a delivery receipt for the letter

Letters that fail to print will still be overridden as delivered:

We tend to send all print requests back in the response if they make it through to print. We do not currently have it set up to mark the print as rejected in the response file.

Only time we don't respond within a response file is if the original data fails in Composition (pre-print).

If the letter is still in sending, check with DVLA if they would send a delivery receipt for it or not. If not, proceed to update letter status. If we are still waiting for a delivery receipt, create a Zendesk ticket / reminder:

  • Set yourself as the requester.
  • Assign "3rd Line--Notify Support".
  • Set the "Form" to "Notify support".
  • Add a tag of "govuk_notify_support" (so it shows up in our filters).
  • Set “Type” to “Task”.
  • Set “Due date” to the next day.
  • Set the title to "Update letter to failed".
  • Add all the details to the description.
  • Submit the ticket as "Pending".
  1. Now update the letter to permanent-failure
UPDATE notifications
   SET notification_status = 'permanent-failure', updated_at=now()
 WHERE id = 'id-of-letter';
  1. Rebuild ft_billing data for that service and day

Run flask command rebuild-ft-billing-for-day --service-id <SERVICE_ID> --day YYYY-MM-DD. This will delete that service's ft_billing data for that day and recalculate it in full.

Cancelling letters for a service

If a service sent a batch of letters by CSV, they should be able to cancel that job by themselves. But if they sent the letters by API, there is no easy way to cancel a large number of letters in the Notify interface, except one by one. In that case, we can help the service user cancel the letters by running following queries:

  1. Check if you get the scope right, based on number of letters the user told you to cancel:
SELECT count(*) FROM notifications
WHERE notification_status = 'created'
AND service_id = '04cca2c6-3352-4da9-8c64-8782d7e47bf6'
AND notification_type = 'letter'
AND created_at BETWEEN '<start datetime>' AND '<end datetime>';

Only letters in created status can be cancelled, and only before the 5:30PM cut-off (unless they were created after the cut-off).

  1. If scope is right, cancel the letters (you will need to make a security notification for this query):
UPDATE notifications
SET notification_status = 'cancelled', updated_at = now()
WHERE notification_status = 'created'
AND service_id = '04cca2c6-3352-4da9-8c64-8782d7e47bf6'
AND notification_type = 'letter'
AND created_at BETWEEN '<start datetime>' AND '<end datetime>';;

Log into the DB as the admin/root postgres user

This is not our standard way to access our database and should only be used for special debugging when you need to see specific queries being run by one of our applications.

Run:

gds aws notify-prod-admin -- db-connect.sh --role admin notifydb -- psql

Add new inbound SMS numbers

We may run out of inbound numbers from time to time. If we do, we'll need to provision more from MMG. Email their support email address requesting additional numbers. They'll send back a CSV containing new numbers.

First, you'll need to go into the CSV and make sure all numbers are written 07xxxxxxxxx rather than 447xxxxxxxxx or any other variant.

To upload them, run the following to ssh to an ECS instance and then run the following commands. (Note: The app chosen to ssh on to is not important as long as it is an API app, but reporting is chosen since it doesn't do any work during the day.)

gds aws notify-prod-admin -- ecs-exec.sh

# paste numbers into a local file
vim ./new_inbound_numbers.csv

flask command insert-inbound-numbers -f ./new_inbound_numbers.csv

You should then test the numbers are good by sending a message to them using the CSV (you'll need to add a phone_number header row) and this template:

https://www.notifications.service.gov.uk/services/d6aa2c68-a2d9-4437-ab19-3ae8eb202553/templates/39d2f2d6-90eb-40b1-893a-5557868af95f.

You should expect to see delivery receipts for the messages sent. The messages won't go anywhere as they won't be assigned to services yet, but there should be logs in kibana reading Inbound number "{}" from mmg not associated with a service

Update provider priority

To update the resting priority points for the SMS providers, update the configuration and redeploy.

How it works

When we send a SMS we choose which provider to send using a random selection with a weighting based on the priorities set for each provider. Each provider has a resting priority set in the config file at start up of the app. There are a couple ways the priority levels can change.

If the provider is returning a 500 when we post we reduce this priority level. Or we can manual change the priority levels from the platform admin provider page

Every five minutes there is a task that will reset the priority to the resting levels if the priorities have been updated because of an exception from the provider or been updated by a platform admin user.

SSH into one of our ECS instances

Follow installation instructions as found here: https://github.com/alphagov/notifications-aws/blob/main/scripts/ecs-exec/README.md

gds aws notify-prod-admin -- ecs-exec.sh

Note that you'll need to run a cyber security notification process if you're logging on to production

Query Redis

We don't offer a direct connection to Redis at the moment. Instead you need to SSH into the admin app and query it using it the redis python package.

gds aws notify-prod-admin -- ecs-exec.sh
flask shell

You can then use any redis-cli commands listed in this link https://redis-py.readthedocs.io/en/latest/commands.html. For example:

from app.extensions import redis_client
redis_client.redis_store.keys("*")
redis_client.redis_store.get("your-key-name")
redis_client.redis_store.delete("your-key-name)

Deal with Redis outages

When Redis is down, you do not need to do anything. This is because Notify doesn't need redis to run:

Avoid processing support tickets while Redis is down, as this will increase the impact when it comes back.

Data in Redis may be stale when it comes back. You need to:

To assess the impact, look for logs of the form "Redis error performing delete on". A deletion happens when the original record is being changed; the cache is then repopulated with "set" the next time the record is queried.

For each failed deletion, you will need to figure out where the data for the key is used, for example:

This runbook was written following a 5 second outage.

Manage SES email sender static IPs

We send from a pool of static IPs in SES. This is so that we can more tightly. Each IP address can send up to x emails per second (how much?), so we have 25 IP addresses so that we can support a burst volume of up to x concurrent email sends.

If we need to increase this amount, we have to create a manual support ticket with AWS support. IP addresses take a while (how long?) to "warm up" (where they ramp up sending emails via them so that mailboxes dont view their sending patterns as suspicious), so we should try and do this in advance if possible.

The IP addresses can be viewed here:

https://eu-west-1.console.aws.amazon.com/ses/home?region=eu-west-1#/dedicated-ips?tabId=standard-pools

If the IP addresses are amended (either new ones added or old ones removed) we should contact Egress, who maintain CJSM (Criminal Justice Secure Mail). Ask in the team for the email address of who to contact for that.

Deal with AWS Shield events (DDoS mitigation)

We have configured AWS Shield Advanced on our public-facing Cloudfront distributions, and enabled automatic application layer DDoS mitigation. This means that when AWS detects something that looks like an application-layer DDoS attack (for example, a request flood), a WAF rule will automatically be added to the ACL for that Cloudfront distribution to try to stop the attack. When this happens, a CloudWatch alarm will be triggered, which in turn will raise an incident on PagerDuty.

If this happens, the first thing you'll want to do is figure out if this is a genuine DDoS attack, as this alarm can sometimes be triggered by large volumes of requests from a single one of our users (potentially due to a service team testing their integration with our API). You might do this by checking if the traffic originates from one or a small number of IP addresses, and querying the logs for requests from those addresses to see if they all relate to the same service.

If it is a genuine DDoS attack, and the automatic mitigation is working, then you don't need to take any further action other than monitoring the situation.

If it's not a genuine DDoS attack, and you want to allow the traffic, you will need to disable automatic DDoS mitigation for that Cloudfront distribution (there doesn't appear to be a way to disable the automatic mitigation for a specific event).

Fixing a service that can't validate reply-to email address

Sometimes the service is having trouble getting the reply-to email address to validate properly because of delays to delivery receipts. As a platform admin you can add the reply-to email address for the service, the validation step will be skipped.

Deleting a service

We allow users to delete a trial service themselves. The main concern with deleting a live service is billing:

  • It's OK to delete a live service as long as they are within their free allowance. Make sure the person making the request is an admin for it.

  • Otherwise, advise the service team to just stop using it. They should keep at least two admin team members in case we need to contact them.

Deleting a service doesn't actually affect billing: deleting a service will rename it, archive its templates and prevent the team members from being able to access it. However, if there's a problem with the bill for a service, we want the service to be "intact" in case we need to refer to it.

Unarchiving a service

We, in rare cases, may want to unarchive a service.

It's normally easier if the user can just create a new service. But they may need old template content, etc. The script is a little nuanced to ensure that templates that were archived during the process are restored, but not those that were manually deleted previously.

You'll need to run a cyber notification, ecs-exec on to a notify-api instance, run the following script from a flask shell, and then clear the users, services and templates cache via the platform admin

import datetime
from app import db
from app.models import Service

services_to_restore = [
    '12341234-1234-1234-1234-123412341234',
    ...
]
# the date of notify's bulk archiving process
date_of_archive = datetime.date(2023, 5, 16)

# ===================================================

for service_id in services_to_restore:
    service = Service.query.get(service_id)
    print(f'Restoring {service_id=} {service.name=}')
    service.active = True
    service.name = service.name.replace(f'_archived_{date_of_archive}_', '')
    service.normalised_service_name = service.normalised_service_name.replace(f'_archived_{date_of_archive}_', '')
    for api_key in service.api_keys:
        if api_key.expiry_date.date() == date_of_archive:
            print(f'Unexpiring {api_key.id=}')
            api_key.expiry_date = None
    for template in service.templates:
        if template.archived and template.updated_at.date() == date_of_archive:
            print(f'Unarchiving {template.id=}')
            template.archived = False
    print('===')
db.session.commit()
print('dont forget to clear the cache!')

Provider 5xx vs 2xx stats

We need to fetch quarterly stats on our SMS provider's HTTP status responses.

  • Log in to production AWS
  • Navigate to Cloudwatch > Logs > Insights
  • Click on the queries tab on the right
  • Choose "SMS provider request success and failure".
  • Set the date range to the desired range and click run.

Note: after April 2024, we can remove the old log groups from the search query

Note: Cloudwatch Insights will time out after 15 minutes so may not be able to retrieve a whole quarter's worth of data. If you want exact figures you'll need to run the query month by month (or some other smaller length of time) and then sum them.

You'll then need to work out a percentage for each provider value by dividing (num successes / (num successes + num failures)) * 100.

Requests for historical delivery records

Sometimes we get asked for records of notifications after they've been archived and are no longer visible in the service dashboard. This could be in response to a legal challenge about whether we sent a notification, for example.

WARNING: the notification_history table can be slow to query due to its size, so use as many WHERE clauses as possible.

\x # turn on extended display to format output

select                                                                          
  notification_history.id as notification_id,                                                                                    
  client_reference,                                                                                                              
  service_id,                                                                                                                   
  services.name as service_name,                                                                                                 
  notification_status,                                                                                                           
  notification_history.created_at as created_at,                                                                                 
  sent_at,                                                                                                                       
  updated_at as delivery_time                                                                                                  
from notification_history                                                                                                      
inner join services on service_id = services.id                                                                                
where                                                                                                                            
  service_id='<service_id>' and  
  notification_type='email' and                                                                        
  client_reference in ('<ref_1>')                                                                                                                                                       

Performance/Availability Stats Summary

Every ~fortnight Caley needs single number summaries for two values:

Platform Availability

For this we need to get the X% of requests successfully processed.

You'll need to log in to production AWS (read-only access, no security notification needed) (gds aws notify-prod -l).

Head to the saved queries in Athena https://eu-west-1.console.aws.amazon.com/athena/home?region=eu-west-1#/query-editor/saved-queries

Every session, you'll need to click the View settings button in the "Before you run your first query" blue box. Then click manage on the next page and copy the following s3 URL into the box and hit save.

s3://aws-athena-query-results-888450439860-eu-west-1

Then click on the cloudfront_notify_api_availability_stats_for_fy_to_date query, double check the date in the query matches the start of this financial year, and click run to get the percentage of successful HTTP requests to Notify API. Make sure to select the correct database from the left hand menu dropdown.

Processing Performance

For this we need to retrieve Y% of messages being sent within 10 seconds of request

log in to the production notify-db

SELECT
sum(messages_within_10_secs)::float / sum(messages_total)::float * 100.0
FROM ft_processing_time 
WHERE bst_date >= '<some date>';

DWP Monthly Reports

Once per month, we have a meeting with DWP. The product team need two reports for their slide deck. Both of these reports are focused on the DWP Notifications Product - PROD service.

Delivery report

This is an adaption of a more generic query to give us the time buckets for how long it took to deliver notifications.

WITH notif_timings AS (
    SELECT
        EXTRACT(EPOCH FROM n.updated_at) - EXTRACT(EPOCH FROM n.created_at) AS "duration"
    FROM notification_history AS n
    WHERE
        service_id = '0cc696c6-b792-409d-99e9-64232f461b0f'
        AND key_type = 'normal'
        AND notification_type = 'sms'
        AND n.created_at >= date_trunc('month', date_trunc('month', NOW())::date - interval '1 day')::date  /* First day of last month */
        AND n.created_at < date_trunc('month', NOW())::date  /* First day of this month */
        AND notification_status = 'delivered'
)
SELECT
    COUNT(*) FILTER(WHERE duration < 60) AS "< 60 seconds (#)",
    ROUND(((COUNT(*) FILTER(WHERE duration < 60)) / COUNT(*)::float * 100)::numeric, 2) AS "< 60 seconds (%)",
    COUNT(*) FILTER(WHERE duration >= 60 AND duration < 300) AS "1 - 5 minutes (#)",
    ROUND(((COUNT(*) FILTER(WHERE duration >= 60 AND duration < 300)) / COUNT(*)::float * 100)::numeric, 2) AS "1 - 5 minutes (%)",
    COUNT(*) FILTER(WHERE duration >= 300 AND duration < 600) AS "5 - 10 minutes (#)",
    ROUND(((COUNT(*) FILTER(WHERE duration >= 300 AND duration < 600)) / COUNT(*)::float * 100)::numeric, 2) AS "5 - 10 minutes (%)",
    COUNT(*) FILTER(WHERE duration >= 600 AND duration < 1800) AS "10 - 30 minutes (#)",
    ROUND(((COUNT(*) FILTER(WHERE duration >= 600 AND duration < 1800)) / COUNT(*)::float * 100)::numeric, 2) AS "10 - 30 minutes (%)",
    COUNT(*) FILTER(WHERE duration >= 1800 AND duration < 3600) AS "30 minutes plus (#)",
    ROUND(((COUNT(*) FILTER(WHERE duration >= 1800)) / COUNT(*)::float * 100)::numeric, 2) AS "30 minutes plus (%)"
FROM notif_timings;

Notification Status Summary

Here's a list of the notification statuses, their counts sent in the previous calendar month, and the percentage that each status represents

SELECT 
sum(notification_count) as "Count Sent",
sum(case when (notification_status = 'delivered') then notification_count else 0 end) as "Count delivered",
sum(case when (notification_status = 'technical-failure') then notification_count else 0 end) as "Count technical-failure",
sum(case when (notification_status = 'temporary-failure') then notification_count else 0 end) as "Count temporary-failure",
sum(case when (notification_status = 'permanent-failure') then notification_count else 0 end) as "Count permanent-failure",
sum(case when (notification_status = 'sent') then notification_count else 0 end) as "Count sent internationally"
FROM ft_notification_status
WHERE service_id = '0cc696c6-b792-409d-99e9-64232f461b0f'
AND bst_date >= date_trunc('month', date_trunc('month', NOW())::date - interval '1 day')::date  /* First day of last month */
AND bst_date < date_trunc('month', NOW())::date  /* First day of this month */
UNION
SELECT 
100 as "Count Sent",
round(sum(case when (notification_status = 'delivered') then notification_count else 0 end)::numeric / sum(notification_count)::numeric * 100.0, 1) as "Count delivered",
round(sum(case when (notification_status = 'technical-failure') then notification_count else 0 end)::numeric / sum(notification_count)::numeric * 100.0, 1) as "Count technical-failure",
round(sum(case when (notification_status = 'temporary-failure') then notification_count else 0 end)::numeric / sum(notification_count)::numeric * 100.0, 1) as "Count temporary-failure",
round(sum(case when (notification_status = 'permanent-failure') then notification_count else 0 end)::numeric / sum(notification_count)::numeric * 100.0, 1) as "Count permanent-failure",
round(sum(case when (notification_status = 'sent') then notification_count else 0 end)::numeric / sum(notification_count)::numeric * 100.0, 1) as "Count sent internationally"
FROM ft_notification_status
WHERE service_id = '0cc696c6-b792-409d-99e9-64232f461b0f'
AND bst_date >= date_trunc('month', date_trunc('month', NOW())::date - interval '1 day')::date  /* First day of last month */
AND bst_date < date_trunc('month', NOW())::date  /* First day of this month */
ORDER BY 1 desc

Request to remove ability to receive inbound text messages for a service

When a service no longer wishes to receive text messages we need run some updates in the database and determine if it's okay to reuse the inbound number for another service. The only time it's okay to release the inbound number is if the service is in trial mode and the inbound number has never been used. You can check inbound_sms and inbound_sms_history to see if any messages for the service have been received.

The service will need to have at least 1 more text message sender set up for their service for us to turn off receiving text messages. If there is not a suitable alternative already in place, add a text message sender called 'GOVUK', which is the standard default.

Let the user know of the impact

Ask the user again to confirm that they actually want to do it and highlight the impact that it will have with something like:

Just to highlight that turning off receiving text messages for your service will mean that any of your users who send a text message to that number will not get an error and may not realise that you have not received it.

Please could you confirm that you'd like us to go ahead with turning off receiving text messages for your service?

Query to remove the ability to receive text messages

It's always good to take a copy of the data before running update/delete statements, so I have included the selects here as well

select * from service_sms_senders where service_id= 'service_id';
select * from inbound_numbers where service_id = 'service_id';

select * from service_permissions where service_id= 'service_id';

delete from service_permissions where service_id = 'service_id' and permission = 'inbound_sms';

delete from service_sms_senders where service_id = 'service_id' and inbound_number_id is not null;

-- run one of the two following queries
-- if we wish to release the number (if it has never been used)
update inbound_numbers set service_id = null, updated_at = now()
where service_id = 'service_id' and active;

-- if we wish to archive the number instead (if it has been used before so we dont want new services receiving the prev service's messages)
update inbound_numbers set service_id = null, updated_at = now(), active = false
where service_id = 'service_id' and active;

You will need to clear the service cache in order to see the permission update on the front end.

MMG and Firetext Staging Limits

We have sms limits in place in Staging currently set to 5000 sms messages per month each. This limit is in place to prevent accidentally spending large amounts of money while performing load tests.

If we do hit this limit we should get the following errors:

  • MMG: Error 208
  • Firetext: 7: Insufficient credit

Unfortunately, as of the time of writing, there is currently no way to increase these limits via the website. A support request must be made.

Rotate all yearly credentials

TBC but we should put in place a list of credentials that need to be rotated every year and instructions here for how to do it.

Upload returned letters

On the first of every month we get a list of letter file names that have been returned. We need to upload that list using returned letters page. This list is usually in .xlsx format so you will need to convert to .csv before cutting and pasting into the textbox.

API will process the references in batches on 5K. You can check Kibana to see if all the tasks processed as expected.

Search events table

If a user questions why something happened, for instance why am I no longer on a service. You can search the events table using a query like below.

select data, data ->> 'user_id' from events where event_type = 'remove_user_from_service' and data ->> 'user_id' = '7e230e3d-0f42-4594-9fd2-98cecd978058' order by created_at desc;

The data field in the events table is json, you can search the json for a specific attribute like user_id by using that syntax above.

What events to do we have?

Other than looking in the code for the events we save you can query for distinct event_types

select distinct event_type from events;
          event_type           
-------------------------------
 sucessful_login_remember_me
 sucessful_login
 add_user_to_service
 update_user_mobile_number
 update_user_email
 remove_user_from_service
 change_broadcast_account_type
 archive_user
(8 rows

It's also worth remembering that the services/{id}/history page gives us a lot of information about what changes have happened.

Update agreement for organisation

If you need to update who signed an agreement for an organisation, you will have to run an update query on the organisation table.

The query will look something like:

UPDATE organisation 
SET
agreement_signed_at = now(),
agreement_signed_version = '5.2',
agreement_signed_on_behalf_of_email_address = '[email protected]',
agreement_signed_on_behalf_of_name = 'Name Surname',
updated_at = now()
WHERE id = 'that-organisation-uuid';

Things to note:

  • agreement_signed_by_id can be left blank - it is only needed when the agreement is being accepted by someone using the admin app
  • check which agreement version this is for

Retry: inbound_sms/callback has retried the max number of times for service

We often get these logs in Kibana.

It's OK to ignore them as ultimately it's the service's responsibility to monitor their own systems (but we know many services don't do this). If you have time you can contact the services affected and ask them to look into it.

Note: we recognise that these logs create noise that makes other, genuine errors harder to see. In future we would like to iterate how we handle callback and inbound SMS failures.

Fixing known functional tests failures

When functional tests fail, the take a screenshot of the last page they saw and caused the failure and upload them to s3 in a put functional-tests-screenshots step, where you can also get a URL to it.

The email has not been validated recently

If the test failed because the test user logged in with an email that has not been recently validated, the solution is to bump that user's email_access_validated_at field to a more recent date, or to the future. For example:

UPDATE users 
    SET email_access_validated_at = '2025-05-05 12:10:00' 
    WHERE email_address = '[email protected]'

429s on document-download-api

If we get 429s from the document-download-api it indicates that someone is trying to confirm their email address too many times - probably more than any normal user would. This might suggest ongoing malicious behaviour and should be investigated. Check Kibana to see which URL is being rate-limited and investigate the user-agent/other indicators of where the traffic is coming from. Consider IP blocking that user temporarily, or removing the document if there are indications it's been compromised.

If traffic looks legit, discuss with devs/product around either removing this alert/support entry or increasing the rate limit.

Blocking access to document-download files

If a service needs to block access to a file they've sent out to users, we can add a blocked S3 Object Tag to the file with a value of true.

If you only have a URL for the affected files, you will need to turn the base64-encoded-uuids back into real UUIDs in order to find the object in S3. The URL for a file on document-download-frontend looks like this:

https://documents.service.gov.uk/d/<SERVICE_UUID_AS_BASE64>/DOCUMENT_UUID_AS_BASE64>?key=ENCRYPTION_KEY

To turn the base64 representation back into the UUID so that you can look the object up in S3/build the ARN to block, do the following in a Python shell:

from notifications_utils.base64_uuid import base64_to_uuid

print('service uuid:', base64_to_uuid('SERVICE_UUID_AS_BASE64'))
print('document uuid:', base64_to_uuid('DOCUMENT_UUID_AS_BASE64'))

If there's only a few files, log into the AWS console with gds aws notify-prod-admin -l (remember to log a cyber security notification), navigate to the S3 bucket and find the object, scroll down to tags and click 'Edit', then add a new tag with key blocked and value true. Save.

The file will be immediately unavailable for download. The document-download-frontend pages may take up to 30 minutes to reflect the file is actually no longer available for download due to caching.

Uploading new crown and non-crown agreements

You will need to have admin access to AWS to do upload new copies of the agreements so make sure you have someone else to do the two eye process for production access with.

Get the new copies of the document and check they are:

  • the right versions
  • dated correctly

For confidence, you might want to upload them to preview first (so the notify.works-mou bucket) and check https://www.notify.works/agreement/crown.pdf and https://www.notify.works/agreement/non-crown.pdf.

When you're ready:

  1. sign into the production AWS console
  2. upload both copies of the agreement to the notifications.service.gov.uk-mou bucket
  3. check https://www.notifications.service.gov.uk/agreement/crown.pdf and https://www.notifications.service.gov.uk/agreement/non-crown.pdf to make sure they download the expected agreements

Get data for an email that contains a bullet point list of services

Sometimes you want to send an email that says

You are part of the following services:

* service 1
* service 2
* service 3

where the number of services may change for every user.

The best way to do this is to create a template like

You are part of the following services:

((service_name))

and provide it a CSV of data in the following format

email_address, service_name, service_name, service_name
[email protected], service 1, service 2, service, 3
[email protected], service 4,,

To get data from our database in that format is a two step process.

Firstly, get your data so that you have a new row for every service that a user belongs to, for example

select users.email_address, services.name
from users
join user_to_service on users.id = user_to_service.user_id
join services on user_to_service.service_id = services.id
where services.active = True
and users.state = 'active'
order by 1, 2;

returns something like

email_address, name
[email protected], service1 
[email protected], service2
[email protected], service3
[email protected], service 4  

Secondly, put the data into Google Sheets. You can then transpose the data in a new sheet using:

  • Set cell B1 to =UNIQUE((rawdata!A2:A1000))
  • Set cell B2 to =TRANSPOSE(QUERY(rawdata!$A$2:$B$1000, "select B where A='"&A2&"'"))

This will put it into the final format for you to upload to Notify.

Rotating local/tools DVLA API credentials

We have a set of credentials for the DVLA API for letter printing that we use for our local accounts. These are made up of an API key and a password. The password needs to be rotated every 90 days and the API key every year.

We will get an email from the DVLA API when the password is 2 weeks away from expiring. When that happens, a developer needs to open a local notify API shell and run:

from app import dvla_client
dvla_client.rotate_password()
dvla_client.rotate_api_key()

These tasks are automatically scheduled and run for our deployed environments, so those are self-healing.

Slow SMS delivery

If you have got here from a Sentry alert flagging slow SMS delivery (over 10% of text messages sent in the last 15 minutes have taken over 5 minutes to deliver), then take a look at grafana: https://grafana-notify.cloudapps.digital/d/icjsQ-MWk2/notify-summary?orgId=1

View the Slow SMS delivery graphs and see if the issue is persisting, and/or seems to be getting worse. Monitor and take action as needed. This may mean raising issues with one or both of our text message providers for them to investigate, or flagging on our public statuspage so that our users can be proactively made aware of the issue.

Look at senderID history

Attach to the database in question and run the following query

select * from service_sms_senders where service_id='$SERVICE_ID';
Clone this wiki locally