Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement sliced object download in the Python Client library. #388

Closed
yaseenlotfi opened this issue Mar 4, 2021 · 15 comments
Closed

Implement sliced object download in the Python Client library. #388

yaseenlotfi opened this issue Mar 4, 2021 · 15 comments
Assignees
Labels
api: storage Issues related to the googleapis/python-storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@yaseenlotfi
Copy link

Is your feature request related to a problem? Please describe.
My use case is to download a single, large blob (~16GBs) into memory in a Python application. This happens as part of a startup process that currently takes 5min. The command line utility, gsutil, has a way to enable sliced downloads and only takes 30s (same machine+network). I would like to take advantage of this optimization in a Pythonic way.

Describe the solution you'd like
Enable sliced downloads in the Python client library such as:
blob.download_to_filename(..., sliced_downloads=True, max_components=16)

This would match gsutil which copies the blob to the local filesystem. It would be great, however, if the blob could be downloaded into memory like:
blob.download_as_bytes(..., sliced_downloads=True, max_components=16)

Describe alternatives you've considered
Knowing that gsutil can run the download concurrently, I tried using the subprocess module to call it. This doesn't work bc it will not run more than one process unlike calling it from the command line. It's also not great to run a shell command from a Python process because it assumes Cloud SDK is setup.

I've tried using ChunkedDownloads in conjunction with multiprocessing but I have not been able to get it to download chunks in parallel. There is also the additional overhead of dealing with the byte stream buffer, transport authentication, checksum/data validation, etc making it non-trivial.

Additional context
Since gsutil is a Python executable itself, I would imagine this could be implemented in the client library (ultimately making the same HTTP Range Requests).

The gsutil command I used on a GCE instance with 16 vCPUs:
gsutil -o ‘GSUtil:parallel_thread_count=1’ -o ‘GSUtil:sliced_object_download_max_components=16’ cp gs://bucket/key /path/to/destination

Open to existing solution I'm not aware of, either, but documentation is sparse on this topic.

@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label Mar 4, 2021
@andrewsg andrewsg added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Mar 4, 2021
@andrewsg
Copy link
Contributor

andrewsg commented Mar 4, 2021

Thanks for your detailed request. I'll look into this.

@andrewsg
Copy link
Contributor

andrewsg commented Mar 5, 2021

I'm surprised 16 slice download improves your time by 10x. Does it really take ten slices or more to saturate your download bandwidth? Is this perhaps functionally a workaround for some sort of bandwidth limiting in your ingress or Google's egress?

@yaseenlotfi
Copy link
Author

I'm not sure - ran it on a GCE instance (e2-highmem-16). How can I specifically test the network saturation?

@andrewsg andrewsg removed the priority: p2 Moderately-important priority. Fix may not be included in next release. label Mar 8, 2021
@andrewsg
Copy link
Contributor

andrewsg commented Mar 8, 2021

Thanks, that should be enough info; if it was on a GCE instance then we can use that for analysis when we're tackling this feature.

@yaseenlotfi
Copy link
Author

Sounds good - and to be clear, the key characteristic here is that when running a sliced download with gsutil configured as described, you can see all 16 cores of the machine are used to capacity (see with htop). This is juxtaposed by just a single process running on a single core when calling the same command from a Python subprocess or using the client library's blob.download_as_x method.

@andrewsg
Copy link
Contributor

andrewsg commented Mar 9, 2021

I see, so it's CPU-bound on your use case. That will be the first thing to look into, then. Thanks.

@tqa236
Copy link

tqa236 commented Jan 2, 2023

should this issue be closed by #844 (as mentioned in the description), that was reverted and added back at #943

I think if we want GitHub to close 2 issues, we need to write "fixes #xxx and fixes #yyy"

@andrewsg
Copy link
Contributor

andrewsg commented Apr 4, 2023

This is solved by #1002.

@andrewsg andrewsg closed this as completed Apr 4, 2023
@gdhananjay
Copy link

gdhananjay commented Jul 23, 2023

@andrewsg @tqa236
This is not working on cloud run, And i am not able to spot out difference. Here is detailed issue link.
https://stackoverflow.com/questions/76747991/cloud-bucket-blob-download-is-very-slow-in-cloud-run
https://www.googlecloudcommunity.com/gc/Serverless/cloud-bucket-blob-download-is-very-slow-in-cloud-run/m-p/614852/highlight/true#M1926

Your help is really appreciated:

Basically blob.download_to_filename and transfer_manager.download_chunks_concurrently not showing any difference on cloud run. whereas it works well on cloud shell.

@andrewsg
Copy link
Contributor

@gdhananjay I'm sorry, I don't have any insight into Cloud Run in particular and I'll have to recommend you reach out to support for that product.

@andrewsg
Copy link
Contributor

@gdhananjay If your issue persists and you believe it is a problem with the client library, please feel free to file a new issue here with more details as to observed performance of single-threaded vs. multi-threaded download, and the context of your application. It looked like you mentioned it was Cloud Functions Gen 2, which I thought was separate from Cloud Run - more information will be helpful.

@gdhananjay
Copy link

gdhananjay commented Jul 25, 2023

@andrewsg

I tried on cloud run also.
My basic aim is to achieve more download speed. 1GB file 4 cpu 8 GB memory using serverless methods.

First option is Python client library as my application is in python, Download never goes beyond 67 MB/s . Tried with many combination of number of processes and chunk sizes using transfer manager. Attached are stats. My basic doubt is for 48 worker count it should at least consume more CPU. It seems it's not consuming more cpu.

gen 2- stats.csv

It is possible to verify download speed and if really using all cpu cores on cloud run. Fact is this work as expected on cloud shell.

Could you guide me on how to reach out cloud run support, I already raised it in forum
https://www.googlecloudcommunity.com/gc/Serverless/cloud-bucket-blob-download-is-very-slow-in-cloud-run/m-p/614852/highlight/true#M1926

This is so critical for us as if it wont work we have to try AWS s3 and Lambda for required speed in download in python

@gdhananjay
Copy link

gdhananjay commented Jul 25, 2023

Also can we test this under platform support
Dockerfile.txt
Attached my dockerfile, Please remove .txt extension. I added it because Dockerfile format is not supported in upload format here.

@andrewsg
Copy link
Contributor

@gdhananjay Okay, please open a separate issue on this github tracker, as I won't get notifications for comments on this closed issue.

When you open that separate issue, please answer this question as well: Are you sure you are not running into the maximum allocated network speed of your Cloud Run or Functions instance? Are there other services that you can access with higher throughput?

@gdhananjay
Copy link

@andrewsg ,
Thank you for your prompt reply.
I filed more detailed bug at
#1093

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/python-storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

5 participants