feat: storage backup #417

vasco-santos · 2021-08-29T16:43:08Z

This PR adds storage backup per #395

In short, it sends each received file to S3 to work around possible failures, where we will keep it until we do not need a backup. This includes the following changes in the codebase:

POST /car, POST /upload add the received file(s) to S3, keyed by its content hash (parallel with adding to cluster)
added database collection to keep track of existing backups per upload
- with support for adding deleted timestamp
- though about a relation with content instead, but if we want to eventually return success if we can add the files to backup but not in cluster, we should not have the content created in the DB
createUpload UDF updated to add received backupKeys to the upload
- note that we support uploading directories, so this must receive an array

Assumptions:

The implementation currently assumes we do not require backup configured for development. I find this to be the best alternative, given creating a personal account in AWS takes days to get it verified.
Given we only use this for PUT (catastrophic events), and putting in a zone is also faster than the parallel ops (cluster.add + cluster.status) we don't currently need to put this in different zones to enhance faster GETs.
- Either way, we will be storing the name of the bucket so that we can move this solution into a map of cloudflare country region and S3 regions.
- We will need to implement a custom solution for this, given there is not such map on npm as far as I can tell.

Possible improvements we can do in follow up PRs:

Setup S3 bucket cross replication for other zones
Do not fail if we can add to backup, but not cluster
- We need to have a job to try to make errored data in first
...

Production and staging credentials were added to wrangler, and infrastructure was setup

Closes #395

vasco-santos · 2021-08-30T10:36:48Z

packages/api/package.json

  },
  "bundlesize": [
    {
      "path": "./dist/main.js",
-      "maxSize": "1 MB",
+      "maxSize": "1.4 MB",


This makes a huge difference given https://bundlephobia.com/package/@aws-sdk/[email protected]

As this is the worker, it is not problematic, even though it is a lot

bundlephobia says it tree-shakes, so may not be so bad. is our build process taking advantage of that?

true thanks, I did not see we didn't have tree-shake enabled. Just added the optimization in webpack config

uh oh, 1MB is the limit for script size we can upload as a single cloudflare worker https://developers.cloudflare.com/workers/platform/limits#script-size

oh, I was not aware of it. Per the docs:

A Workers script can be up to 1MB in size after compression.

our bundlesize check is done pre-compression though. I will look into that too

ok, making the bundle size job test compressed made the bundle size change from 1.06MB to 287.12KB

vasco-santos · 2021-09-02T16:13:31Z

packages/api/src/env.js

+      const s3Endpoint = env.S3_BUCKET_ENDPOINT || (typeof S3_BUCKET_ENDPOINT === 'undefined' ? undefined : S3_BUCKET_ENDPOINT)
+      env.s3Client = new S3Client({
+        endpoint: s3Endpoint,
+        forcePathStyle: !!s3Endpoint, // Force path if endpoint provided


only way we can enforce a local endpoint url for testing. In alternative, given backup is not required, we can remove this and not test with it, as we should be able to trust @aws-sdk/client-s3 module

note to self: this "global or env" pattern is getting out of hand. We could smoosh the blobals into the env property in the top level fetch event handler to simplify the this code and anywhere else we need to use and env var.

olizilla · 2021-09-13T09:19:46Z

packages/api/package.json

  },
  "bundlesize": [
    {
      "path": "./dist/main.js",
-      "maxSize": "1 MB",
+      "maxSize": "1.4 MB",


bundlephobia says it tree-shakes, so may not be so bad. is our build process taking advantage of that?

packages/api/README.md

olizilla

This is decent, and we could deploy as is, but i think we should:

Rename things to be more specific. ~~Backup~~ -> S3Object (or similar). We also have code to back our pins up to Pinata, so we should be clearer in the code and schema that these new code paths are about staging user uploads on S3 while we wait for the filecoin deals to become active.
Have seperate S3 object key prefixes e.g car/ and files/ (or buckets) for CAR uploads vs raw file uploads. They will have to be handled seperately to recreate a cluster from a bucket, so it would he helpful to be able to query S3 for a list of each type. I am imagining a recovery job that woulc have to either slurp in car files with format=car or format=unixfs to a new ipfs-cluster. That could be determined from our db, but it'd operationally comforting if we could also see the difference in s3.
nice to have preserve the filenames and paths for raw file uploads. we could name them files/<hash>/<file path> or similar... or we could wait for the CID to come back from cluster and key raw file uploads with a CID prefix, so a recovery script could determine which raw files were uploaded together. Again, this is all available in the db, but it would be rad if we didn't have to trust and rely on the db in a disaster scenario.

olizilla · 2021-09-13T09:58:42Z

packages/api/src/env.js

+      const s3Endpoint = env.S3_BUCKET_ENDPOINT || (typeof S3_BUCKET_ENDPOINT === 'undefined' ? undefined : S3_BUCKET_ENDPOINT)
+      env.s3Client = new S3Client({
+        endpoint: s3Endpoint,
+        forcePathStyle: !!s3Endpoint, // Force path if endpoint provided


note to self: this "global or env" pattern is getting out of hand. We could smoosh the blobals into the env property in the top level fetch event handler to simplify the this code and anywhere else we need to use and env var.

olizilla · 2021-09-13T10:11:37Z

packages/db/fauna/schema.graphql

+  """
+  Backup bucket name.
+  """
+  name: String!


name is ambigious here, it should be bucketName`.

I named backup and name here taking into consideration that we are using S3, but it could be any other solution. I would prefer to use agnostic names that are not tied to the service we use. What do you think?

vasco-santos · 2021-09-13T10:51:25Z

nice to have preserve the filenames and paths for raw file uploads. we could name them files// or similar... or we could wait for the CID to come back from cluster and key raw file uploads with a CID prefix,

👍🏼 this is a good call

I will add this, waiting on CID from cluster will mean longer response times in uploads (also more likely timeouts from worker). I think you meant upload name here? For /files we could use file name/path, but for CAR this would require unixfs exporting. My proposal is to use backup keys as follows:

/files/<upload_name>/car/<hash>
/files/<upload_name>/file/<hash>
/directory/<upload_name>/file/<hash>

Let me know your thoughts

olizilla · 2021-09-13T11:02:40Z

my point about preserving file names was motivated by the difference where a CAR of a file tree would preserve all the file names internally, so it would be ok to simply key those by the hash of the CAR, but we would lose all the filenames of and uploaded directory if they were only keyed by hash. I had not considered the user "upload name" property. I'm kinda ok with that remaining in the db for now.

it would be worth spiking out what a recovery script would look like. At it's simplest we's just want to be able to quickly stand up a new cluster and re-add all the content from S3, producting the same CIDs. Simple enough for the CAR files, but needs more care for the raw files.

alanshaw

I don't want to stand in the way of this, but it would be awesome if these went into s3 by root CID, so we could just serve them from a gateway easily. I feel that would be quite difficult to do as this PR stands but I appreciate that this is not the motivation behind the work and just a nice to have.

Maybe partials could go in as directories:

root_cid/sha256(car0)
root_cid/sha256(car1)
root_cid/sha256(car2)

...and then we stream them out by listing the directory and trimming the CAR header from all but the first.

Hard bits:

detect partials (maybe everything gets a directory even if not partial)
undeterministic CARs
it would be nice not to do a directory listing for non-partials (I'd like to just be able to just get root_cid.car from the bucket)

The cool bit here is that if we can add by root CID then we don't need to store the bucket key of each partial CAR along with the upload...

packages/api/README.md

packages/api/src/car.js

alanshaw · 2021-09-13T14:45:37Z

packages/api/src/env.js

+      env.s3BucketName = env.S3_BUCKET_NAME || S3_BUCKET_NAME
+    }
+  } catch { // not required in dev mode
+    console.log('no setup for backups')


Is it possible to detect if we're in dev mode? We should throw the error in production.

packages/db/fauna/schema.graphql

alanshaw · 2021-09-13T15:33:55Z

packages/db/fauna/resources/Function/createUpload.js

+                      Select('backupData', Var('data')),
+                      Lambda(
+                        ['data'],
+                        Create('Backup', {


If a user uploads the same file twice then we'll get multiple backup objects for the same CAR pointing to the same bucket+key?

We will get a single S3 object stored considering it is stored on a key value fashion, but multiple entries in Fauna for the same file.

I thought about adding an index at first, but it would run in every single upload to check if it already exists. Using @unique would also fail, which we don't want.

So, given we should just iterate on the list of objects in S3 for backup I think the best solution is to simply keep record of everything as is. With the record, we can easily access specific data to prfioritize backups as needed.

What do you think?

Regarding this, If a same upload is created, we will not create the Upload entry in Fauna, which means we will not create the backups for this in Fauna as well.

The above is only true because there is a bug 🤦🏼 We are not adding the following chunks to Fauna... Working on a PR

olizilla · 2021-09-14T13:31:18Z

we want to avoid the perf hit of uploading to 2 places in serial, so how about

add to cluster and upload to s3 in parallel, to a temporary statging prefix like /new/<timestamp>/<sha256>.
when we get a cid back from cluster add, rename/mv the s3 key to be prefixed with the cid instead? /car/<cid>.

the (probably safe) assumption is that it will be quicker to rename a key in s3 than it is to upload the file.

vasco-santos · 2021-09-14T14:40:53Z

the (probably safe) assumption is that it will be quicker to rename a key in s3 than it is to upload the file.

@olizilla Sadly, s3 JS sdk does not seem to support rename/move in an atomic operations. We would need to use copyObject with deleteObject to achieve this without having data replicated.

We can make pin and backup in parallel, but given this is a disaster recovery, I think we are better with just getting the information from the DB to perform the recovery.

@alanshaw I like your suggestion, but as you mention this might leave the scope of the backup a bit. If we want this to be more than a backup, and actually be able to use S3 as a backend of a gateway, we likely need to go with computing rootCID first, aka cluster add + backup in serial. When we receive a CAR, we can likely trust the received car and use its root to backup (if we receive a different root from IPFS cluster later on, we can revert the backup). But, with generic files, we will not have the rootCID beforehand.

olizilla · 2021-09-15T09:57:08Z

@alanshaw It's unfortunate, but I think we have to avoid treating what goes in to S3 as something we could serve back to the world as content-addressed data, without some post-processing.

The problem is these are user uploads from the wild. They could send us a deliberately malformed CAR that contained blocks that are not part of the DAG for the declared root CID. Such a CAR could could be added to cluster, and we'd get back the root CID as declared, but it would not be correct to store and use that file as the CAR file for that CID.

vasco-santos · 2021-09-20T12:54:43Z

Given #480 is on the ways to unify both /upload and /car, we will wait for it to be merged here.
This allows us to get to know the root cid in advance and act accordingly to save data into S3 with the following namespace:

/${rootCid}/${userId}/${carHash}

As a next step, we will also look into a script to boot a new IPFS cluster with all the data stored in S3, as well as S3 data normalization

Have the API pack uploaded files into a CAR, to make the two paths more similar and ideally simplify the upload to S3 path. There were two ideas here 1. If uploaded files are packed into a CAR by the API before sending to cluster we then only have to deal wth sending CARs to S3, which would be nicer than having to handle both raw files and CARs. 2. We have a lot of different notions of "how big is this thing"... the size of the uploaded CAR, the cumulative size of uploaded files, the sum of the size of each block in gthe CAR, the size or bytes value returned by cluter, and the size of all the blocks in the dag where we understand the ipld codecs and can follow the links, which could be different to the sum of the size of blocks in a CAR if it has redundent or duplicate blocks. It would be good to simplify that. The dream was - To lean on cluster to get the unixFS CumulativeSize when uploading a CAR with files with a unixFS root - ...and pack raw files uploads into a CAR. The reality is - We can't get the unixFS CumulativeSize out of cluster. It might return the FileSize for a CAR with a single file in, but that will be 0 for Directories. - We can't pack files into a unixFS flavour CAR with ipfs-car in a cloudflare worker today as it fails with a about `importScrtips` not being available in CloudFlare workers. License: (Apache-2.0 AND MIT) Signed-off-by: Oli Evans <[email protected]>

License: (Apache-2.0 AND MIT) Signed-off-by: Oli Evans <[email protected]>

Co-authored-by: Alan Shaw <[email protected]>

Co-authored-by: Oli Evans <[email protected]>

olizilla

🚀

vasco-santos force-pushed the feat/storage-backup branch 4 times, most recently from 3e044f4 to 5e81416 Compare August 30, 2021 10:34

vasco-santos commented Aug 30, 2021

View reviewed changes

vasco-santos force-pushed the feat/storage-backup branch from 5e81416 to 8fe63f7 Compare August 31, 2021 11:57

vasco-santos requested review from alanshaw and olizilla and removed request for alanshaw August 31, 2021 14:08

vasco-santos force-pushed the feat/storage-backup branch 3 times, most recently from 0b84c98 to 990b5cf Compare September 2, 2021 15:07

vasco-santos marked this pull request as ready for review September 2, 2021 16:07

vasco-santos force-pushed the feat/storage-backup branch from 990b5cf to 8645721 Compare September 2, 2021 16:11

vasco-santos commented Sep 2, 2021

View reviewed changes

vasco-santos force-pushed the feat/storage-backup branch from 8645721 to dbe0dec Compare September 2, 2021 16:17

vasco-santos requested review from alanshaw and olizilla and removed request for olizilla and alanshaw September 2, 2021 16:19

olizilla reviewed Sep 13, 2021

View reviewed changes

olizilla suggested changes Sep 13, 2021

View reviewed changes

alanshaw reviewed Sep 13, 2021

View reviewed changes

vasco-santos requested a review from olizilla September 24, 2021 14:50

vasco-santos mentioned this pull request Sep 24, 2021

Implement Sentry for error logging #494

Open

olizilla and others added 20 commits September 24, 2021 17:24

chore: derive cumlative size for dag-pb roots

947549f

License: (Apache-2.0 AND MIT) Signed-off-by: Oli Evans <[email protected]>

chore: fix cumulativeSize calculation

4002e4e

License: (Apache-2.0 AND MIT) Signed-off-by: Oli Evans <[email protected]>

chore: fix lint

2659d45

License: (Apache-2.0 AND MIT) Signed-off-by: Oli Evans <[email protected]>

feat: storage backup

3adcc2c

chore: rebase unify car upload endpoints PR

f10f1d4

chore: use treeshake in webpack optimization

20a0cb3

chore: update s3 key format

074e8d8

chore: make bundlesize job for api test compressed bundle

5d634d1

chore: improve s3 setup motivation docs

f5d22f4

Co-authored-by: Alan Shaw <[email protected]>

chore: keep s3 url saved

b505cef

chore: improve env setup for optional backup

abd4570

chore: add car extension

1ea2e49

chore: rebase

943ffda

chore: decrease bundle size test value

4442070

chore: update jsdocs for env type

d1be3b1

chore: rebase

29a5389

chore: apply suggestions from code review

c7d31a9

Co-authored-by: Oli Evans <[email protected]>

chore: use raw directory

382cbcf

fix: cf pages job

548b2d2

vasco-santos force-pushed the feat/storage-backup branch from c23e734 to 548b2d2 Compare September 24, 2021 15:24

olizilla approved these changes Sep 27, 2021

View reviewed changes

vasco-santos merged commit ae5423a into main Sep 27, 2021

vasco-santos deleted the feat/storage-backup branch September 27, 2021 09:11

This was referenced Sep 27, 2021

fix: chunked backup creation #496

Merged

[research] API Publish with Intermittent script timeout #499

Open

github-actions bot mentioned this pull request Jun 27, 2022

chore(main): release db 1.0.0 #1578

Closed

vasco-santos mentioned this pull request Feb 15, 2024

w32023-to-w3up storacha/migrate-to-w3up#3

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: storage backup #417

feat: storage backup #417

vasco-santos commented Aug 29, 2021 •

edited

Loading

vasco-santos Aug 30, 2021

olizilla Sep 13, 2021

vasco-santos Sep 21, 2021

olizilla Sep 21, 2021

vasco-santos Sep 21, 2021

vasco-santos Sep 21, 2021

vasco-santos Sep 2, 2021 •

edited

Loading

olizilla Sep 13, 2021

olizilla Sep 13, 2021

olizilla left a comment

olizilla Sep 13, 2021

olizilla Sep 13, 2021

vasco-santos Sep 13, 2021

vasco-santos commented Sep 13, 2021

olizilla commented Sep 13, 2021

alanshaw left a comment

alanshaw Sep 13, 2021

alanshaw Sep 13, 2021

vasco-santos Sep 22, 2021

vasco-santos Sep 27, 2021

vasco-santos Sep 27, 2021

olizilla commented Sep 14, 2021 •

edited

Loading

vasco-santos commented Sep 14, 2021 •

edited

Loading

olizilla commented Sep 15, 2021

vasco-santos commented Sep 20, 2021

olizilla left a comment

feat: storage backup #417

feat: storage backup #417

Conversation

vasco-santos commented Aug 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vasco-santos Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olizilla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vasco-santos commented Sep 13, 2021

olizilla commented Sep 13, 2021

alanshaw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olizilla commented Sep 14, 2021 • edited Loading

vasco-santos commented Sep 14, 2021 • edited Loading

olizilla commented Sep 15, 2021

vasco-santos commented Sep 20, 2021

olizilla left a comment

Choose a reason for hiding this comment

vasco-santos commented Aug 29, 2021 •

edited

Loading

vasco-santos Sep 2, 2021 •

edited

Loading

olizilla commented Sep 14, 2021 •

edited

Loading

vasco-santos commented Sep 14, 2021 •

edited

Loading