- Status: accepted
- Deciders: Eimi, James, Alli, Susie
- Date: 2019-11-05
This is related to a previous ADR: see modular architecture ADR
Technical Story: Evaluating Firebase for our backend, replacing most of the AWS infrastructure, except for PSTT client (queue and instance to talk to PSTT).
We have quite a few components to build up in the backend to get DPE to a stable point. Although initially decided to do development in AWS, working on AWS had been quite challenging with a steep learning curve. In a single project cycle, we've reached a point where we had infrastructures of most microservices setup. Unfortunately resources were removed from the project and priorities shifted to UX research, while back-end development halted. Picking these development tasks from half baked microservices is difficult. We now have to relearn what we've developed and complete integration. In order to simplify back-end development, we are looking at Firebase (Google Cloud Platform). Since Firebase is a plaftorm that targets development for startups and R&D, it provides many features out of the box. This could speed up developing the backend, namely the realtime database, authentication, while keeping possibilities open for offline, mobile, and multi-user solutions. We must consider technical (security, development, transfer, upfront learning) and non-technical (legal) cost.
Since we've designed the React components to be reusable, there is a question of whether AWS vs GCP makes it easier for transfer. Previous projects had all been rebuilt from scratch, including the infrastructure. Realistically, does it matter which Cloud provider we go with?
Name | Technology | Description |
---|---|---|
Firebase | GCP Firebase hosting, main component | The hosted solution |
Firestore | GCP Firestore, database | The database with additional capabilities (realtime) |
Audio Converter Function | GCP Function, serverless | The serverless service that strips audio from the original AV content |
AWS Uplaoder Function | GCP Function, serverless, KMS | The serverless service that uploads a file to S3 |
STT Client | AWS EC2, PSTT | SQS Consumer that consumes messages to send to Platform STT (third party) and forwards the status updating messages from the SNS topics from PSTT |
STT Client Queue | AWS SQS | A SQS service that takes new Job messages to send to PSTT Client |
Notification Function | Functions, serverless | A serverless service that takes Notification messages off the Pub/Subs serice to update Firestore |
STT Topic | AWS STS | A topic for anything to subscribe to for notifications. This is to help with collecting all the notification. The messages are created by the STT Client |
- Audio Converter Function is listening for Storage changes
- AWS Proxy Function is listening for Storage changes
- STT Client is listening for messages (job creation) from a queue
- User uploads AV file to Storage from Firebase
- Audio Converter Function:
- detects file upload
- strips audio from file
- upload file to Storage
- AWS Proxy Function:
- detects file upload
- retrieves Key from GCP KMS
- assumes role as AWS IAM
- uploads file to S3
- S3 Event triggers to be sent to Queue
- STT Client:
- creates new Plaform STT job: 4. upload file to Platform STT's S3 bucket 5. send message with file loaded to Platform STT
- Forwards notifications from Platform STT's SNS to STT Notifications SNS
- AWS notification Subscription Function
- updates Firestore
- Firestore listener updates Firebase
- Security (Authentication, PSTT integration)
- Legal
- Transfer
- Technical feasibility of DPE only
- Technical feasibility of CI, reproduceability
- Cost (time, money, engineering, opportunity)
- AWS only
- Firebase and AWS
Firebase and AWS combined as PSTT integration is feasible to do, even with cross acount complications. Features such as authentication, user specific data retrieval, integration with database without migrations is already completed. Firebase provides abstractions around security and integrations with the database, as well as Functions. To save time around the project with features such as realtime database and easy API use, let's progress with Firebase. There are still some things such as CI, Deployment management that can be worked on, but this is a stretch goal for now. All in all, I believe that it will save time, and simply the overall architecture and code.
- Faster turnaround for development
- Authentication
- Realtime database beneficial for some of our DPE requirements
- Managed services with many abstractions
- Growing knowledge of GCP
- BBC Login (stretch goal)
- monitoring given out of the box (console.log maps to logging to GCP Logging Console)
- Upfront cost of learning for developers and there are many unknowns, until looked into
- Security concerns
- Redesigning architecture
- Redesigning database model
- Transfer not possible with infrastructure (?)
Before we discuss the pros and cons of each of the options, we needed to have a clear idea about each component, how far it is from so-called-completion in AWS and GCP and pros and cons of each technology. Here I describe:
What it needs to do What it currently does
Anywhere noted with * is an indicator of a stretch goal.
- From the UI, users can view and alter (their own*) data
- Initiate transcription process
- The integration with the API works on both AWS and GCP.
- both are deployed as web endpoints.
- both endpoints are secured by some form of authentication.
- There is a cert-based authentication set up.
- Individual user data.*
- file upload
- S3 setup and access
- there is already an authentication set up for GCP that is whitelisting based.
- User specific data retrieval*
- file upload (there are easy ways of showing progress of uploading of files)
- Storage access (see security rules setup, is open by default)
- The potential here is BBC Login integration, which is more user friendly.*
var storageRef = firebase.storage.ref("folderName/file.jpg");
var fileUpload = document.getElementById("fileUpload");
fileUpload.on(‘change’, function(evt) {
var firstFile = evt.target.file[0]; // get the first file uploaded
var uploadTask = storageRef.put(firstFile);
uploadTask.on(‘state_changed’, function progress(snapshot) {
console.log(snapshot.totalBytesTransferred); // progress of upload
});
});
What it needs to do
- The API connects to the database to do CRUD operations
- The API invokes the audio conversion
- The API invokes the transcription service
- The API is notified by PSTT Client to update status of transcription
- Secure connection to database, audio conversion, transcription service
AWS and GCP
- API is incomplete
- Can connect to the DB (AWS has been buggy as of late)
AWS
- deployed API
- deployed Postgres DB
- full integration with database
- migrations setup
- local environment setup*
- VM
- environment set up is complete
GCP
- local environment setup*
- Setup with Firebase's emulator tool
- deployed Firestore (realtime DB, NoSQL)
- full integration with database
- no migrations necessary as it is a NoSQL database
- it is also possible to store references to a specific data, which simplifies certain aspects of file retrieval
- The setup in client is also simplified as there needs to be no abstraction of the database connection
- removes the need for
setTimeout()
setup in Client for data retrieval as Firestore offers realtime data retrieval
What it needs to do
- Convert input to audio
- Store output in an accessible online location
- Handle long form content*
AWS and GCP
- Deployed
- Uploads audio content to storage
- Integration with PSTT Client
AWS
- Integration with API
- Queue message being sent from API to Audio Converter queue
- environment set up is complete
GCP
- Integration with Client
- I've done it but not currently operating due to cost (free tier)
- environment set up is complete (follow instructions from their docs)
What it needs to do
- send content (URL) to PSTT
- send notifications of status change to API
- send output back to users by getting the content from the S3 bucket.
What it currently does
AWS and GCP
- deployed
- logic is functional
- does ^ (unknown until further tested)
GCP
- IAM for the AWS / GCP integration (this requires 2 account creations with credentials. One to access the PSTT client's queue, and another for the PSTT client to use to update the database with the status and the JSON transcription blob).
- store Key in GCP's KMS
To do this, we:
- create an IAM in AWS as part of PSTT Client
- store the credentials of the IAM in GCP's KMS.
- the GCP service assumes the IAM role and
This is something that Datalab has already done so we can use their knowledge.
Continuing development of AWS only. This will mean picking up where we left off. We currently have a setup around every single microservice. It is difficult to say how far we are, as each setup in Jenkins and integration could provide some challenges.
- Good, because best practices are ensured and infosec approved.
- Good, because all the microservices are there (infrastructrually)
- Bad, because the best practices means jumping through many developmental steps which could be time consuming.
- Bad, we currently don't have any integration other than UI, DB and API.
- Bad, DB and API does connect but is faulty at the moment
- Bad, no easy way to setup local testing environment for DB.
- Bad, we need to do migrations, but we have not set that up.
This means we will be doing Firebase and AWS combined. See ![]("./DPE - firebase ver.png"). Eimi has spoken to several people to clear out the unknowns around cross-acount integrations, deployment pipelines, etc.
- Good, because it simplifies the code (client).
- Good, because authentication and complexity is handled.
- Good, because we have a whitelisting system already in place.
- Good, because it gives you a local test environment.
- Bad, because complexity around cross acount integration
- Bad, because not so steep but still, a learning curve
- Good, because realtime database*
- Pros and cons: Lessons from a small Firebase project.
- More on the lifecycle of GCP Functions
- GCP Environment set up can be done by following instructions from their docs)