Implement sliced object download in the Python Client library. #388

yaseenlotfi · 2021-03-04T21:49:05Z

Is your feature request related to a problem? Please describe.
My use case is to download a single, large blob (~16GBs) into memory in a Python application. This happens as part of a startup process that currently takes 5min. The command line utility, gsutil, has a way to enable sliced downloads and only takes 30s (same machine+network). I would like to take advantage of this optimization in a Pythonic way.

Describe the solution you'd like
Enable sliced downloads in the Python client library such as:
blob.download_to_filename(..., sliced_downloads=True, max_components=16)

This would match gsutil which copies the blob to the local filesystem. It would be great, however, if the blob could be downloaded into memory like:
blob.download_as_bytes(..., sliced_downloads=True, max_components=16)

Describe alternatives you've considered
Knowing that gsutil can run the download concurrently, I tried using the subprocess module to call it. This doesn't work bc it will not run more than one process unlike calling it from the command line. It's also not great to run a shell command from a Python process because it assumes Cloud SDK is setup.

I've tried using ChunkedDownloads in conjunction with multiprocessing but I have not been able to get it to download chunks in parallel. There is also the additional overhead of dealing with the byte stream buffer, transport authentication, checksum/data validation, etc making it non-trivial.

Additional context
Since gsutil is a Python executable itself, I would imagine this could be implemented in the client library (ultimately making the same HTTP Range Requests).

The gsutil command I used on a GCE instance with 16 vCPUs:
gsutil -o ‘GSUtil:parallel_thread_count=1’ -o ‘GSUtil:sliced_object_download_max_components=16’ cp gs://bucket/key /path/to/destination

Open to existing solution I'm not aware of, either, but documentation is sparse on this topic.

The text was updated successfully, but these errors were encountered:

andrewsg · 2021-03-04T22:30:28Z

Thanks for your detailed request. I'll look into this.

andrewsg · 2021-03-05T18:50:53Z

I'm surprised 16 slice download improves your time by 10x. Does it really take ten slices or more to saturate your download bandwidth? Is this perhaps functionally a workaround for some sort of bandwidth limiting in your ingress or Google's egress?

yaseenlotfi · 2021-03-07T16:54:06Z

I'm not sure - ran it on a GCE instance (e2-highmem-16). How can I specifically test the network saturation?

andrewsg · 2021-03-08T19:54:00Z

Thanks, that should be enough info; if it was on a GCE instance then we can use that for analysis when we're tackling this feature.

yaseenlotfi · 2021-03-09T01:50:56Z

Sounds good - and to be clear, the key characteristic here is that when running a sliced download with gsutil configured as described, you can see all 16 cores of the machine are used to capacity (see with htop). This is juxtaposed by just a single process running on a single core when calling the same command from a Python subprocess or using the client library's blob.download_as_x method.

andrewsg · 2021-03-09T17:59:12Z

I see, so it's CPU-bound on your use case. That will be the first thing to look into, then. Thanks.

tqa236 · 2023-01-02T08:08:42Z

should this issue be closed by #844 (as mentioned in the description), that was reverted and added back at #943

I think if we want GitHub to close 2 issues, we need to write "fixes #xxx and fixes #yyy"

andrewsg · 2023-04-04T17:19:37Z

This is solved by #1002.

gdhananjay · 2023-07-23T13:49:29Z

@andrewsg @tqa236
This is not working on cloud run, And i am not able to spot out difference. Here is detailed issue link.
https://stackoverflow.com/questions/76747991/cloud-bucket-blob-download-is-very-slow-in-cloud-run
https://www.googlecloudcommunity.com/gc/Serverless/cloud-bucket-blob-download-is-very-slow-in-cloud-run/m-p/614852/highlight/true#M1926

Your help is really appreciated:

Basically blob.download_to_filename and transfer_manager.download_chunks_concurrently not showing any difference on cloud run. whereas it works well on cloud shell.

andrewsg · 2023-07-24T16:18:08Z

@gdhananjay I'm sorry, I don't have any insight into Cloud Run in particular and I'll have to recommend you reach out to support for that product.

andrewsg · 2023-07-24T17:24:30Z

@gdhananjay If your issue persists and you believe it is a problem with the client library, please feel free to file a new issue here with more details as to observed performance of single-threaded vs. multi-threaded download, and the context of your application. It looked like you mentioned it was Cloud Functions Gen 2, which I thought was separate from Cloud Run - more information will be helpful.

gdhananjay · 2023-07-25T05:01:59Z

@andrewsg

I tried on cloud run also.
My basic aim is to achieve more download speed. 1GB file 4 cpu 8 GB memory using serverless methods.

First option is Python client library as my application is in python, Download never goes beyond 67 MB/s . Tried with many combination of number of processes and chunk sizes using transfer manager. Attached are stats. My basic doubt is for 48 worker count it should at least consume more CPU. It seems it's not consuming more cpu.

gen 2- stats.csv

It is possible to verify download speed and if really using all cpu cores on cloud run. Fact is this work as expected on cloud shell.

Could you guide me on how to reach out cloud run support, I already raised it in forum
https://www.googlecloudcommunity.com/gc/Serverless/cloud-bucket-blob-download-is-very-slow-in-cloud-run/m-p/614852/highlight/true#M1926

This is so critical for us as if it wont work we have to try AWS s3 and Lambda for required speed in download in python

gdhananjay · 2023-07-25T05:12:26Z

Also can we test this under platform support
Dockerfile.txt
Attached my dockerfile, Please remove .txt extension. I added it because Dockerfile format is not supported in upload format here.

andrewsg · 2023-07-25T17:03:57Z

@gdhananjay Okay, please open a separate issue on this github tracker, as I won't get notifications for comments on this closed issue.

When you open that separate issue, please answer this question as well: Are you sure you are not running into the maximum allocated network speed of your Cloud Run or Functions instance? Are there other services that you can access with higher throughput?

gdhananjay · 2023-07-25T19:21:25Z

@andrewsg ,
Thank you for your prompt reply.
I filed more detailed bug at
#1093

product-auto-label bot added the api: storage Issues related to the googleapis/python-storage API. label Mar 4, 2021

andrewsg added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Mar 4, 2021

andrewsg removed the priority: p2 Moderately-important priority. Fix may not be included in next release. label Mar 8, 2021

andrewsg assigned danielduhh Mar 8, 2021

andrewsg mentioned this issue Aug 19, 2022

feat: Add "transfer_manager" module for concurrent uploads and downloads as a preview feature #844

Merged

andrewsg closed this as completed Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement sliced object download in the Python Client library. #388

Implement sliced object download in the Python Client library. #388

yaseenlotfi commented Mar 4, 2021

andrewsg commented Mar 4, 2021

andrewsg commented Mar 5, 2021

yaseenlotfi commented Mar 7, 2021

andrewsg commented Mar 8, 2021

yaseenlotfi commented Mar 9, 2021

andrewsg commented Mar 9, 2021

tqa236 commented Jan 2, 2023

andrewsg commented Apr 4, 2023

gdhananjay commented Jul 23, 2023 •

edited

Loading

andrewsg commented Jul 24, 2023

andrewsg commented Jul 24, 2023

gdhananjay commented Jul 25, 2023 •

edited

Loading

gdhananjay commented Jul 25, 2023 •

edited

Loading

andrewsg commented Jul 25, 2023

gdhananjay commented Jul 25, 2023

Implement sliced object download in the Python Client library. #388

Implement sliced object download in the Python Client library. #388

Comments

yaseenlotfi commented Mar 4, 2021

andrewsg commented Mar 4, 2021

andrewsg commented Mar 5, 2021

yaseenlotfi commented Mar 7, 2021

andrewsg commented Mar 8, 2021

yaseenlotfi commented Mar 9, 2021

andrewsg commented Mar 9, 2021

tqa236 commented Jan 2, 2023

andrewsg commented Apr 4, 2023

gdhananjay commented Jul 23, 2023 • edited Loading

andrewsg commented Jul 24, 2023

andrewsg commented Jul 24, 2023

gdhananjay commented Jul 25, 2023 • edited Loading

gdhananjay commented Jul 25, 2023 • edited Loading

andrewsg commented Jul 25, 2023

gdhananjay commented Jul 25, 2023

gdhananjay commented Jul 23, 2023 •

edited

Loading

gdhananjay commented Jul 25, 2023 •

edited

Loading

gdhananjay commented Jul 25, 2023 •

edited

Loading