Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large multipart uploads with aws cli to GCS fail #330

Closed
dantman opened this issue Aug 25, 2020 · 6 comments · Fixed by #333
Closed

Large multipart uploads with aws cli to GCS fail #330

dantman opened this issue Aug 25, 2020 · 6 comments · Fixed by #333

Comments

@dantman
Copy link

dantman commented Aug 25, 2020

I was using a docker image (mysql-backup) that uploads to a "S3" backend using the aws cli. Trying to use it with s3proxy connected to GCP as the upload destination fails with the following error.

An error occurred (BadDigest) when calling the CompleteMultipartUpload operation

More details from the log (includes the shell commands executed and full cli response)

+ AWS_ENDPOINT_OPT='--endpoint-url http://cloud-backups-s3proxy.default.svc.cluster.local'
+ aws --endpoint-url http://cloud-backups-s3proxy.default.svc.cluster.local s3 cp /tmp/backups/db_backup_2020-08-24T20:46:05Z.tgz s3://cloud-backups/db/db_backup_2020-08-24T20:46:05Z.tgz
Complupload failed: tmp/backups/db_backup_2020-08-24T20:46:05Z.tgz to s3://cloud-backups/db/db_backup_2020-08-24T20:46:05Z.tgz An error occurred (BadDigest) when calling the CompleteMultipartUpload operation (reached max retries: 4): Bad Request

I assume a similar error could be triggered by configuring s3 to use GCP, trying to use the aws s3 cli to to and upload, and using the multipart_threshold to force the cli to do multipart uploads for smaller files.

I also presume GCP and Amazon have different interpretations of how hash digests of multipart uploads should work.

@gaul
Copy link
Owner

gaul commented Aug 26, 2020

How large is the object you are uploading and what are the part sizes/number of parts? Also can you share any relevant logs when running S3Proxy with trace-level logging?

@dantman
Copy link
Author

dantman commented Aug 26, 2020

Sorry, the uploaded files come from a db dump that takes ~15M to do and then disappears with the container. So I can't easily get you exact numbers.

I can tell you that the managed SQL DB that is the source of the dump has a SSD that is only 31GB large. The information_schema data suggests a database size of about 10GB. And my monitoring says that the pod that did the dump had a 6.06GiB disk usage (to do it's operation of dumping 3 dbs to SQL, archiving those files into a tarball, then uploading that tarball to "s3"). So the file is probably <6GB.

The upload is done by aws cp [file] s3://... so the number/sizes of parts is handled by aws-cli.

@dantman
Copy link
Author

dantman commented Aug 26, 2020

I can however confirm s3proxy connected to GCP (and GCP's own S3 backend) do work with aws cp if the file is small enough it does not trigger a multipart upload. So this is definitely something specific to a file large enough to trigger a multipart upload but within the limits of what you can upload to S3/GCP.

@gaul
Copy link
Owner

gaul commented Aug 26, 2020

AWS CLI defaults to an 8 MB part size so a 6 GB object would be 750 parts. GCS natively supports only 32 parts. Can you try changing the value of multipart-chunksize to something larger, e.g., 1 GB? This should work around the symptoms.

I think S3Proxy could be changed to do something more complicated by using GCS ability to recursively combine sets of 32 parts, although this would take some effort.

@gaul gaul changed the title Multipart uploads with aws cli to GCP (Google Could Files) fail Large multipart uploads with aws cli to GCS fail Aug 26, 2020
@dantman
Copy link
Author

dantman commented Aug 27, 2020

Hmmm, that GCP limitation sounds awful. Why can't everyone just agree on one standardized API for cloud file storage?

I tried changing the multipart_chunk_size option to 1GB (it was really hard in a Docker container since aws-cli doesn't even offer a cli option for it). And confirmed that the config file is actually loaded. However I still ended up with the error.

I don't know how to debug what chunks aws-cli sent so I can't verify whether this ended up under the limit and indicates a different issue or my test was just faulty.

gaul added a commit that referenced this issue Sep 22, 2020
This recursively combines up to 32 sets of 32 parts, allowing 1024
part multipart uploads.  Fixes #330.
@gaul
Copy link
Owner

gaul commented Sep 22, 2020

@dantman Agree that multiple protocols are painful for users. Many of the non-S3 implementations have added either partial or full support for S3 so this situation is improving. While I work for Google, I have no relationship with Google Cloud so I recommended giving them feedback directly, by Twitter or otherwise. GCS does offer S3-compatible access but it does not support MPU at all. You might be able to configure your application to not use multipart upload since GCS S3 supports objects greater than 5 GB.

I am confused why changing the chunk size did not work and you might want to debug this a further. I spent a few hours looking into recursively combining objects in S3Proxy to work around this limitation and you can test #333. This needs a little more work before I merge it but I would appreciate if you could give feedback.

gaul added a commit that referenced this issue Sep 22, 2020
This recursively combines up to 32 sets of 32 parts, allowing 1024
part multipart uploads.  Fixes #330.
gaul added a commit that referenced this issue Sep 22, 2020
This recursively combines up to 32 sets of 32 parts, allowing 1024
part multipart uploads.  Fixes #330.
gaul added a commit that referenced this issue Oct 4, 2020
This recursively combines up to 32 sets of 32 parts, allowing 1024
part multipart uploads.  Fixes #330.
@gaul gaul closed this as completed in #333 Mar 7, 2021
gaul added a commit that referenced this issue Mar 7, 2021
This recursively combines up to 32 sets of 32 parts, allowing 1024
part multipart uploads.  Fixes #330.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants