-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: storage backup #417
feat: storage backup #417
Conversation
3e044f4
to
5e81416
Compare
packages/api/package.json
Outdated
}, | ||
"bundlesize": [ | ||
{ | ||
"path": "./dist/main.js", | ||
"maxSize": "1 MB", | ||
"maxSize": "1.4 MB", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a huge difference given https://bundlephobia.com/package/@aws-sdk/[email protected]
As this is the worker, it is not problematic, even though it is a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bundlephobia says it tree-shakes, so may not be so bad. is our build process taking advantage of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true thanks, I did not see we didn't have tree-shake enabled. Just added the optimization in webpack config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh oh, 1MB is the limit for script size we can upload as a single cloudflare worker https://developers.cloudflare.com/workers/platform/limits#script-size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I was not aware of it. Per the docs:
A Workers script can be up to 1MB in size after compression.
our bundlesize check is done pre-compression though. I will look into that too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, making the bundle size job test compressed made the bundle size change from 1.06MB
to 287.12KB
5e81416
to
8fe63f7
Compare
0b84c98
to
990b5cf
Compare
990b5cf
to
8645721
Compare
packages/api/src/env.js
Outdated
const s3Endpoint = env.S3_BUCKET_ENDPOINT || (typeof S3_BUCKET_ENDPOINT === 'undefined' ? undefined : S3_BUCKET_ENDPOINT) | ||
env.s3Client = new S3Client({ | ||
endpoint: s3Endpoint, | ||
forcePathStyle: !!s3Endpoint, // Force path if endpoint provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only way we can enforce a local endpoint url for testing. In alternative, given backup is not required, we can remove this and not test with it, as we should be able to trust @aws-sdk/client-s3
module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: this "global or env" pattern is getting out of hand. We could smoosh the blobals into the env property in the top level fetch event handler to simplify the this code and anywhere else we need to use and env var.
8645721
to
dbe0dec
Compare
packages/api/package.json
Outdated
}, | ||
"bundlesize": [ | ||
{ | ||
"path": "./dist/main.js", | ||
"maxSize": "1 MB", | ||
"maxSize": "1.4 MB", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bundlephobia says it tree-shakes, so may not be so bad. is our build process taking advantage of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is decent, and we could deploy as is, but i think we should:
-
Rename things to be more specific.
Backup-> S3Object (or similar). We also have code to back our pins up to Pinata, so we should be clearer in the code and schema that these new code paths are about staging user uploads on S3 while we wait for the filecoin deals to become active. -
Have seperate S3 object key prefixes e.g
car/
andfiles/
(or buckets) for CAR uploads vs raw file uploads. They will have to be handled seperately to recreate a cluster from a bucket, so it would he helpful to be able to query S3 for a list of each type. I am imagining a recovery job that woulc have to either slurp in car files with format=car or format=unixfs to a new ipfs-cluster. That could be determined from our db, but it'd operationally comforting if we could also see the difference in s3. -
nice to have preserve the filenames and paths for raw file uploads. we could name them
files/<hash>/<file path>
or similar... or we could wait for the CID to come back from cluster and key raw file uploads with a CID prefix, so a recovery script could determine which raw files were uploaded together. Again, this is all available in the db, but it would be rad if we didn't have to trust and rely on the db in a disaster scenario.
packages/api/src/env.js
Outdated
const s3Endpoint = env.S3_BUCKET_ENDPOINT || (typeof S3_BUCKET_ENDPOINT === 'undefined' ? undefined : S3_BUCKET_ENDPOINT) | ||
env.s3Client = new S3Client({ | ||
endpoint: s3Endpoint, | ||
forcePathStyle: !!s3Endpoint, // Force path if endpoint provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: this "global or env" pattern is getting out of hand. We could smoosh the blobals into the env property in the top level fetch event handler to simplify the this code and anywhere else we need to use and env var.
packages/db/fauna/schema.graphql
Outdated
""" | ||
Backup bucket name. | ||
""" | ||
name: String! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name
is ambigious here, it should be
bucketName`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I named backup and name here taking into consideration that we are using S3, but it could be any other solution. I would prefer to use agnostic names that are not tied to the service we use. What do you think?
👍🏼 this is a good call I will add this, waiting on CID from cluster will mean longer response times in uploads (also more likely timeouts from worker). I think you meant upload name here? For
Let me know your thoughts |
my point about preserving file names was motivated by the difference where a CAR of a file tree would preserve all the file names internally, so it would be ok to simply key those by the hash of the CAR, but we would lose all the filenames of and uploaded directory if they were only keyed by hash. I had not considered the user "upload name" property. I'm kinda ok with that remaining in the db for now. it would be worth spiking out what a recovery script would look like. At it's simplest we's just want to be able to quickly stand up a new cluster and re-add all the content from S3, producting the same CIDs. Simple enough for the CAR files, but needs more care for the raw files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to stand in the way of this, but it would be awesome if these went into s3 by root CID, so we could just serve them from a gateway easily. I feel that would be quite difficult to do as this PR stands but I appreciate that this is not the motivation behind the work and just a nice to have.
Maybe partials could go in as directories:
root_cid/sha256(car0)
root_cid/sha256(car1)
root_cid/sha256(car2)
...and then we stream them out by listing the directory and trimming the CAR header from all but the first.
Hard bits:
- detect partials (maybe everything gets a directory even if not partial)
- undeterministic CARs
- it would be nice not to do a directory listing for non-partials (I'd like to just be able to just get
root_cid.car
from the bucket)
The cool bit here is that if we can add by root CID then we don't need to store the bucket key of each partial CAR along with the upload...
packages/api/src/env.js
Outdated
env.s3BucketName = env.S3_BUCKET_NAME || S3_BUCKET_NAME | ||
} | ||
} catch { // not required in dev mode | ||
console.log('no setup for backups') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to detect if we're in dev mode? We should throw the error in production.
Select('backupData', Var('data')), | ||
Lambda( | ||
['data'], | ||
Create('Backup', { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user uploads the same file twice then we'll get multiple backup objects for the same CAR pointing to the same bucket+key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will get a single S3 object stored considering it is stored on a key value fashion, but multiple entries in Fauna for the same file.
I thought about adding an index at first, but it would run in every single upload to check if it already exists. Using @unique
would also fail, which we don't want.
So, given we should just iterate on the list of objects in S3 for backup I think the best solution is to simply keep record of everything as is. With the record, we can easily access specific data to prfioritize backups as needed.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this, If a same upload is created, we will not create the Upload entry in Fauna, which means we will not create the backups for this in Fauna as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above is only true because there is a bug 🤦🏼 We are not adding the following chunks to Fauna... Working on a PR
we want to avoid the perf hit of uploading to 2 places in serial, so how about
the (probably safe) assumption is that it will be quicker to rename a key in s3 than it is to upload the file. |
@olizilla Sadly, s3 JS sdk does not seem to support rename/move in an atomic operations. We would need to use copyObject with deleteObject to achieve this without having data replicated. We can make pin and backup in parallel, but given this is a disaster recovery, I think we are better with just getting the information from the DB to perform the recovery. @alanshaw I like your suggestion, but as you mention this might leave the scope of the backup a bit. If we want this to be more than a backup, and actually be able to use S3 as a backend of a gateway, we likely need to go with computing rootCID first, aka cluster add + backup in serial. When we receive a CAR, we can likely trust the received car and use its root to backup (if we receive a different root from IPFS cluster later on, we can revert the backup). But, with generic files, we will not have the rootCID beforehand. |
@alanshaw It's unfortunate, but I think we have to avoid treating what goes in to S3 as something we could serve back to the world as content-addressed data, without some post-processing. The problem is these are user uploads from the wild. They could send us a deliberately malformed CAR that contained blocks that are not part of the DAG for the declared root CID. Such a CAR could could be added to cluster, and we'd get back the root CID as declared, but it would not be correct to store and use that file as the CAR file for that CID. |
Given #480 is on the ways to unify both
As a next step, we will also look into a script to boot a new IPFS cluster with all the data stored in S3, as well as S3 data normalization |
Have the API pack uploaded files into a CAR, to make the two paths more similar and ideally simplify the upload to S3 path. There were two ideas here 1. If uploaded files are packed into a CAR by the API before sending to cluster we then only have to deal wth sending CARs to S3, which would be nicer than having to handle both raw files and CARs. 2. We have a lot of different notions of "how big is this thing"... the size of the uploaded CAR, the cumulative size of uploaded files, the sum of the size of each block in gthe CAR, the size or bytes value returned by cluter, and the size of all the blocks in the dag where we understand the ipld codecs and can follow the links, which could be different to the sum of the size of blocks in a CAR if it has redundent or duplicate blocks. It would be good to simplify that. The dream was - To lean on cluster to get the unixFS CumulativeSize when uploading a CAR with files with a unixFS root - ...and pack raw files uploads into a CAR. The reality is - We can't get the unixFS CumulativeSize out of cluster. It might return the FileSize for a CAR with a single file in, but that will be 0 for Directories. - We can't pack files into a unixFS flavour CAR with ipfs-car in a cloudflare worker today as it fails with a about `importScrtips` not being available in CloudFlare workers. License: (Apache-2.0 AND MIT) Signed-off-by: Oli Evans <[email protected]>
License: (Apache-2.0 AND MIT) Signed-off-by: Oli Evans <[email protected]>
License: (Apache-2.0 AND MIT) Signed-off-by: Oli Evans <[email protected]>
License: (Apache-2.0 AND MIT) Signed-off-by: Oli Evans <[email protected]>
Co-authored-by: Alan Shaw <[email protected]>
Co-authored-by: Oli Evans <[email protected]>
c23e734
to
548b2d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
This PR adds storage backup per #395
In short, it sends each received file to S3 to work around possible failures, where we will keep it until we do not need a backup. This includes the following changes in the codebase:
POST /car
,POST /upload
add the received file(s) to S3, keyed by its content hash (parallel with adding to cluster)createUpload
UDF updated to add received backupKeys to the uploadAssumptions:
Possible improvements we can do in follow up PRs:
Production and staging credentials were added to wrangler, and infrastructure was setup
Closes #395