-
-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for parallel tile downloads and control of cache #217
Added support for parallel tile downloads and control of cache #217
Conversation
…unds2img functions
…hile also caching the downloaded tiles. The solution was to use parallel processes instead of threads.
Just made a small change as downloads occasionally failed due a memory error when using threads for the parallel downloads. It was caused by a clash with the memory caching from joblib to cache the _fetch_tile() function, and was fixed by using processes instead of threads for the parallel download. The processes take ~0.5s to spawn, so this added a minor delay before downloading starts, but it is still significantly faster compared to the for-loop implementation. |
I wonder whether any useful default For example, OSM asks you to restrict yourself to two download connections at a time.. I've raised TOS-violating defaults before w.r.t. caching in #202, but I'm not sure what the maintainers (@darribas, @martinfleis) think? I'm fine with letting the user be the "violator," but we should probably document that the defaults set in the package may violate some TOS for some providers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ljwolf's comments are valid here.
I am fine with allowing parallel downloads as there are use case (paid tiles...) where this is perfectly valid use case that is in no conflict with TOS. I would just probably default to 1, to ensure that a user does not violate anything without knowing it.
contextily/tile.py
Outdated
@@ -74,6 +75,7 @@ def bounds2raster( | |||
ll=False, | |||
wait=0, | |||
max_retries=2, | |||
num_parallel_tile_downloads=16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_parallel_tile_downloads=16, | |
n_connections=16, |
Something shorter like this would be preferable.
… it to default value of 1. Added different n_connections values when testing the bounds2img() function.
Good point. I've just made a new commit where the default value is changed to 1. The I also added tests of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question on the max connections value, apart from that it is ready to go. Thanks!
contextily/tile.py
Outdated
tiles = list(mt.tiles(w, s, e, n, [zoom])) | ||
tile_urls = [provider.build_url(x=tile.x, y=tile.y, z=tile.z) for tile in tiles] | ||
# download tiles | ||
max_connections = 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this specific upper limit? I am happy leaving that a responsibility of a user.
…added a parameter to disable caching, which is useful in resource constrained environments when using parallel connections for download.
Yeah, agree that hardcoding In the same commit, I also added a parameter to disable the tile caching, as that makes using parallel connections in resource constrained environments much faster. I.e., it took a lot of time to spawn the processes in small serverless functions, where disabling the cache so the parallel download can be done with threads avoids this. I wasn't sure whether this should be added as a parameter though (or if it should be added at all). So, in general I wasn't sure whether these changes were in line with the thoughts behind Contextily 😅 Let me know what you think, and then we might need another round of changes before its ready 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or it could it be omitted entirely, and the documentation for n_connections could just state that the user should be careful with it.
That would be my preference.
Disabling caching is good, it resolves #202 :)
Sounds good :) I've just made a new commit based on the comments. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good to me!
@ljwolf do you think this will be a good solution for caching?
I think so, this is great! I would make the option |
@JacobJeppesen Can you change the keyword per @ljwolf's suggestion? It makes sense. |
…bounds2img() function parameters to avoid using double negative
Yeah, I agree that the double negative should be avoided. I've changed it to Let me know if there are any more changes needed :) |
Thanks @JacobJeppesen! |
Thanks to you too! :) |
This pull request is a fix for #215
I have added support for parallel tile downloads in the bounds2raster and bounds2img functions. It gives some quite significant speed improvements, with minor changes to the code. I wasn't sure were to put
max_num_parallel_tile_downloads = 32
, so for now it is on line 227 intile.py
(i.e., inside the bounds2img function code). Let me know if there is anything you would like to have changed.Thanks for an awesome package btw.! 😃