Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: storage backup #417

Merged
merged 20 commits into from
Sep 27, 2021
Merged

feat: storage backup #417

merged 20 commits into from
Sep 27, 2021

Conversation

vasco-santos
Copy link
Contributor

@vasco-santos vasco-santos commented Aug 29, 2021

This PR adds storage backup per #395

In short, it sends each received file to S3 to work around possible failures, where we will keep it until we do not need a backup. This includes the following changes in the codebase:

  • POST /car, POST /upload add the received file(s) to S3, keyed by its content hash (parallel with adding to cluster)
  • added database collection to keep track of existing backups per upload
    • with support for adding deleted timestamp
    • though about a relation with content instead, but if we want to eventually return success if we can add the files to backup but not in cluster, we should not have the content created in the DB
  • createUpload UDF updated to add received backupKeys to the upload
    • note that we support uploading directories, so this must receive an array

Assumptions:

  • The implementation currently assumes we do not require backup configured for development. I find this to be the best alternative, given creating a personal account in AWS takes days to get it verified.
  • Given we only use this for PUT (catastrophic events), and putting in a zone is also faster than the parallel ops (cluster.add + cluster.status) we don't currently need to put this in different zones to enhance faster GETs.
    • Either way, we will be storing the name of the bucket so that we can move this solution into a map of cloudflare country region and S3 regions.
    • We will need to implement a custom solution for this, given there is not such map on npm as far as I can tell.

Possible improvements we can do in follow up PRs:

  • Setup S3 bucket cross replication for other zones
  • Do not fail if we can add to backup, but not cluster
    • We need to have a job to try to make errored data in first
  • ...

Production and staging credentials were added to wrangler, and infrastructure was setup

Closes #395

@vasco-santos vasco-santos force-pushed the feat/storage-backup branch 4 times, most recently from 3e044f4 to 5e81416 Compare August 30, 2021 10:34
},
"bundlesize": [
{
"path": "./dist/main.js",
"maxSize": "1 MB",
"maxSize": "1.4 MB",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a huge difference given https://bundlephobia.com/package/@aws-sdk/[email protected]

As this is the worker, it is not problematic, even though it is a lot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bundlephobia says it tree-shakes, so may not be so bad. is our build process taking advantage of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true thanks, I did not see we didn't have tree-shake enabled. Just added the optimization in webpack config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh oh, 1MB is the limit for script size we can upload as a single cloudflare worker https://developers.cloudflare.com/workers/platform/limits#script-size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I was not aware of it. Per the docs:

A Workers script can be up to 1MB in size after compression.

our bundlesize check is done pre-compression though. I will look into that too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, making the bundle size job test compressed made the bundle size change from 1.06MB to 287.12KB

@vasco-santos vasco-santos requested review from alanshaw and olizilla and removed request for alanshaw August 31, 2021 14:08
@vasco-santos vasco-santos force-pushed the feat/storage-backup branch 3 times, most recently from 0b84c98 to 990b5cf Compare September 2, 2021 15:07
@vasco-santos vasco-santos marked this pull request as ready for review September 2, 2021 16:07
const s3Endpoint = env.S3_BUCKET_ENDPOINT || (typeof S3_BUCKET_ENDPOINT === 'undefined' ? undefined : S3_BUCKET_ENDPOINT)
env.s3Client = new S3Client({
endpoint: s3Endpoint,
forcePathStyle: !!s3Endpoint, // Force path if endpoint provided
Copy link
Contributor Author

@vasco-santos vasco-santos Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only way we can enforce a local endpoint url for testing. In alternative, given backup is not required, we can remove this and not test with it, as we should be able to trust @aws-sdk/client-s3 module

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: this "global or env" pattern is getting out of hand. We could smoosh the blobals into the env property in the top level fetch event handler to simplify the this code and anywhere else we need to use and env var.

@vasco-santos vasco-santos requested review from alanshaw and olizilla and removed request for olizilla and alanshaw September 2, 2021 16:19
},
"bundlesize": [
{
"path": "./dist/main.js",
"maxSize": "1 MB",
"maxSize": "1.4 MB",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bundlephobia says it tree-shakes, so may not be so bad. is our build process taking advantage of that?

packages/api/README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@olizilla olizilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is decent, and we could deploy as is, but i think we should:

  • Rename things to be more specific. Backup -> S3Object (or similar). We also have code to back our pins up to Pinata, so we should be clearer in the code and schema that these new code paths are about staging user uploads on S3 while we wait for the filecoin deals to become active.

  • Have seperate S3 object key prefixes e.g car/ and files/ (or buckets) for CAR uploads vs raw file uploads. They will have to be handled seperately to recreate a cluster from a bucket, so it would he helpful to be able to query S3 for a list of each type. I am imagining a recovery job that woulc have to either slurp in car files with format=car or format=unixfs to a new ipfs-cluster. That could be determined from our db, but it'd operationally comforting if we could also see the difference in s3.

  • nice to have preserve the filenames and paths for raw file uploads. we could name them files/<hash>/<file path> or similar... or we could wait for the CID to come back from cluster and key raw file uploads with a CID prefix, so a recovery script could determine which raw files were uploaded together. Again, this is all available in the db, but it would be rad if we didn't have to trust and rely on the db in a disaster scenario.

const s3Endpoint = env.S3_BUCKET_ENDPOINT || (typeof S3_BUCKET_ENDPOINT === 'undefined' ? undefined : S3_BUCKET_ENDPOINT)
env.s3Client = new S3Client({
endpoint: s3Endpoint,
forcePathStyle: !!s3Endpoint, // Force path if endpoint provided
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: this "global or env" pattern is getting out of hand. We could smoosh the blobals into the env property in the top level fetch event handler to simplify the this code and anywhere else we need to use and env var.

"""
Backup bucket name.
"""
name: String!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name is ambigious here, it should be bucketName`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named backup and name here taking into consideration that we are using S3, but it could be any other solution. I would prefer to use agnostic names that are not tied to the service we use. What do you think?

@vasco-santos
Copy link
Contributor Author

nice to have preserve the filenames and paths for raw file uploads. we could name them files// or similar... or we could wait for the CID to come back from cluster and key raw file uploads with a CID prefix,

👍🏼 this is a good call

I will add this, waiting on CID from cluster will mean longer response times in uploads (also more likely timeouts from worker). I think you meant upload name here? For /files we could use file name/path, but for CAR this would require unixfs exporting. My proposal is to use backup keys as follows:

  • /files/<upload_name>/car/<hash>
  • /files/<upload_name>/file/<hash>
  • /directory/<upload_name>/file/<hash>

Let me know your thoughts

@olizilla
Copy link
Contributor

my point about preserving file names was motivated by the difference where a CAR of a file tree would preserve all the file names internally, so it would be ok to simply key those by the hash of the CAR, but we would lose all the filenames of and uploaded directory if they were only keyed by hash. I had not considered the user "upload name" property. I'm kinda ok with that remaining in the db for now.

it would be worth spiking out what a recovery script would look like. At it's simplest we's just want to be able to quickly stand up a new cluster and re-add all the content from S3, producting the same CIDs. Simple enough for the CAR files, but needs more care for the raw files.

Copy link
Member

@alanshaw alanshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to stand in the way of this, but it would be awesome if these went into s3 by root CID, so we could just serve them from a gateway easily. I feel that would be quite difficult to do as this PR stands but I appreciate that this is not the motivation behind the work and just a nice to have.

Maybe partials could go in as directories:

root_cid/sha256(car0)
root_cid/sha256(car1)
root_cid/sha256(car2)

...and then we stream them out by listing the directory and trimming the CAR header from all but the first.

Hard bits:

  • detect partials (maybe everything gets a directory even if not partial)
  • undeterministic CARs
  • it would be nice not to do a directory listing for non-partials (I'd like to just be able to just get root_cid.car from the bucket)

The cool bit here is that if we can add by root CID then we don't need to store the bucket key of each partial CAR along with the upload...

packages/api/README.md Outdated Show resolved Hide resolved
packages/api/src/car.js Outdated Show resolved Hide resolved
env.s3BucketName = env.S3_BUCKET_NAME || S3_BUCKET_NAME
}
} catch { // not required in dev mode
console.log('no setup for backups')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to detect if we're in dev mode? We should throw the error in production.

packages/db/fauna/schema.graphql Outdated Show resolved Hide resolved
Select('backupData', Var('data')),
Lambda(
['data'],
Create('Backup', {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user uploads the same file twice then we'll get multiple backup objects for the same CAR pointing to the same bucket+key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will get a single S3 object stored considering it is stored on a key value fashion, but multiple entries in Fauna for the same file.

I thought about adding an index at first, but it would run in every single upload to check if it already exists. Using @unique would also fail, which we don't want.

So, given we should just iterate on the list of objects in S3 for backup I think the best solution is to simply keep record of everything as is. With the record, we can easily access specific data to prfioritize backups as needed.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this, If a same upload is created, we will not create the Upload entry in Fauna, which means we will not create the backups for this in Fauna as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above is only true because there is a bug 🤦🏼 We are not adding the following chunks to Fauna... Working on a PR

@olizilla
Copy link
Contributor

olizilla commented Sep 14, 2021

we want to avoid the perf hit of uploading to 2 places in serial, so how about

  • add to cluster and upload to s3 in parallel, to a temporary statging prefix like /new/<timestamp>/<sha256>.
  • when we get a cid back from cluster add, rename/mv the s3 key to be prefixed with the cid instead? /car/<cid>.

the (probably safe) assumption is that it will be quicker to rename a key in s3 than it is to upload the file.

@vasco-santos
Copy link
Contributor Author

vasco-santos commented Sep 14, 2021

the (probably safe) assumption is that it will be quicker to rename a key in s3 than it is to upload the file.

@olizilla Sadly, s3 JS sdk does not seem to support rename/move in an atomic operations. We would need to use copyObject with deleteObject to achieve this without having data replicated.

We can make pin and backup in parallel, but given this is a disaster recovery, I think we are better with just getting the information from the DB to perform the recovery.

@alanshaw I like your suggestion, but as you mention this might leave the scope of the backup a bit. If we want this to be more than a backup, and actually be able to use S3 as a backend of a gateway, we likely need to go with computing rootCID first, aka cluster add + backup in serial. When we receive a CAR, we can likely trust the received car and use its root to backup (if we receive a different root from IPFS cluster later on, we can revert the backup). But, with generic files, we will not have the rootCID beforehand.

@olizilla
Copy link
Contributor

@alanshaw It's unfortunate, but I think we have to avoid treating what goes in to S3 as something we could serve back to the world as content-addressed data, without some post-processing.

The problem is these are user uploads from the wild. They could send us a deliberately malformed CAR that contained blocks that are not part of the DAG for the declared root CID. Such a CAR could could be added to cluster, and we'd get back the root CID as declared, but it would not be correct to store and use that file as the CAR file for that CID.

@vasco-santos
Copy link
Contributor Author

Given #480 is on the ways to unify both /upload and /car, we will wait for it to be merged here.
This allows us to get to know the root cid in advance and act accordingly to save data into S3 with the following namespace:

/${rootCid}/${userId}/${carHash}

As a next step, we will also look into a script to boot a new IPFS cluster with all the data stored in S3, as well as S3 data normalization

olizilla and others added 20 commits September 24, 2021 17:24
Have the API pack uploaded files into a CAR, to make the two paths more similar and ideally simplify the upload to S3 path.

There were two ideas here
1. If uploaded files are packed into a CAR by the API before sending to cluster we then only have to deal wth sending CARs to S3, which would be nicer than having to handle both raw files and CARs.
2. We have a lot of different notions of "how big is this thing"... the size of the uploaded CAR, the cumulative size of uploaded files, the sum of the size of each block in gthe CAR, the size or bytes value returned by cluter, and the size of all the blocks in the dag where we understand the ipld codecs and can follow the links, which could be different to the sum of the size of blocks in a CAR if it has redundent or duplicate blocks. It would be good to simplify that.

The dream was
- To lean on cluster to get the unixFS CumulativeSize when uploading a CAR with files with a unixFS root
- ...and pack raw files uploads into a CAR.

The reality is
- We can't get the unixFS CumulativeSize out of cluster. It might return the FileSize for a CAR with a single file in, but that will be 0 for Directories.
- We can't pack files into a unixFS flavour CAR with ipfs-car in a cloudflare worker today as it fails with a about `importScrtips` not being available in CloudFlare workers.

License: (Apache-2.0 AND MIT)
Signed-off-by: Oli Evans <[email protected]>
License: (Apache-2.0 AND MIT)
Signed-off-by: Oli Evans <[email protected]>
License: (Apache-2.0 AND MIT)
Signed-off-by: Oli Evans <[email protected]>
License: (Apache-2.0 AND MIT)
Signed-off-by: Oli Evans <[email protected]>
Copy link
Contributor

@olizilla olizilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

backups: write CAR uploads to S3
3 participants