-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add gdown as optional requirement for dataset GDrive download #8237
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8237
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 3402822 with merge base 4c0f441 (): FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
We can potentially also replace vision/torchvision/datasets/utils.py Lines 93 to 103 in e0fd033
with functionality of vision/torchvision/datasets/utils.py Lines 141 to 143 in e0fd033
Meaning, if we would either need to make Internally this is not really an issue, since we can just use the GDrive download functionality directly. However, |
As a torchvision user and developer of a library within PyTorch's galaxy, I fully support this idea. The only question I had was about the security risks (e.g. exploits through this dependency). However, given the high number of star-ers and users of the gdown, its current policy regarding PRs, and the simplicity of the lib, it might not be that much of a problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Philip
Regarding getting rid of _get_google_drive_file_id
, I guess it depends on how likely we think it is to break in the future. Sounds like it's not going to break as I can't imagine google breaking those existing URLs, so I guess it's fine to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @pmeier
#8237) Reviewed By: vmoens Differential Revision: D55062803 fbshipit-source-id: bef350179c70043ad71bf57bcc18ac14bdd3b487
Downloading datasets from GDrive has been a problem since as long as I remember. You can latest surge of user reports in #8220 and #8226, but there are many more on the issue tracker. The main problem is that GDrive does not have an API and one needs to resort to parsing the HTML.
We tried our best in
vision/torchvision/datasets/utils.py
Line 210 in e0fd033
but this still breaks regularly when Google changes something on their side. Paired with our long release cycles, this is major source of frustration not only for users but for maintainers as well.
This PR removes our custom handling in favor of
gdown
. The dependency is optional, meaning users only need to install it when they want to download datasets that host files on GDrive. This is similar to other optional dependencies for datasets that we already have, e.g.vision/torchvision/datasets/caltech.py
Lines 14 to 16 in e0fd033
Of course users will still run into issues when something changes on Googles side that
gdown
does not account for. Still, we have two upsides here:gdown
is not bound to our release cycles, users can get fixes between two PyTorch releases by just upgradinggdown
.