Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gdown as optional requirement for dataset GDrive download #8237

Merged
merged 5 commits into from
Feb 8, 2024

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Jan 29, 2024

Downloading datasets from GDrive has been a problem since as long as I remember. You can latest surge of user reports in #8220 and #8226, but there are many more on the issue tracker. The main problem is that GDrive does not have an API and one needs to resort to parsing the HTML.

We tried our best in

def download_file_from_google_drive(

but this still breaks regularly when Google changes something on their side. Paired with our long release cycles, this is major source of frustration not only for users but for maintainers as well.

This PR removes our custom handling in favor of gdown. The dependency is optional, meaning users only need to install it when they want to download datasets that host files on GDrive. This is similar to other optional dependencies for datasets that we already have, e.g.

.. warning::
This class needs `scipy <https://docs.scipy.org/doc/>`_ to load target files from `.mat` format.

Of course users will still run into issues when something changes on Googles side that gdown does not account for. Still, we have two upsides here:

  1. Instead of rolling with our custom solution, there is a third-party tool that we can use and thus release some pressure from the TV maintainers.
  2. Since gdown is not bound to our release cycles, users can get fixes between two PyTorch releases by just upgrading gdown.

Copy link

pytorch-bot bot commented Jan 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8237

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 3402822 with merge base 4c0f441 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

.github/workflows/tests-schedule.yml Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
torchvision/datasets/caltech.py Show resolved Hide resolved
@pmeier
Copy link
Collaborator Author

pmeier commented Jan 29, 2024

We can potentially also replace

def _get_google_drive_file_id(url: str) -> Optional[str]:
parts = urlparse(url)
if re.match(r"(drive|docs)[.]google[.]com", parts.netloc) is None:
return None
match = re.match(r"/file/d/(?P<id>[^/]*)", parts.path)
if match is None:
return None
return match.group("id")

with functionality of gdown. However, this is currently used in download_url

file_id = _get_google_drive_file_id(url)
if file_id is not None:
return download_file_from_google_drive(file_id, root, filename, md5)

Meaning, if we would either need to make gdown a hard dependency (or at least a soft dependency of torchvision.datasets) or we lose proper error messages in case gdown is not available. In the latter case, the best we can do is to not check for GDrive and thus let our regular download process continue. This will then fail with a checksum error, because we downloaded some HTML rather than the file.

Internally this is not really an issue, since we can just use the GDrive download functionality directly. However, torchvision.datasets.utils.download_url is public. Meaning, we cannot just remove the "check for GDrive" functionality without breaking BC.

@o-laurent
Copy link

As a torchvision user and developer of a library within PyTorch's galaxy, I fully support this idea. The only question I had was about the security risks (e.g. exploits through this dependency). However, given the high number of star-ers and users of the gdown, its current policy regarding PRs, and the simplicity of the lib, it might not be that much of a problem.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Philip

Regarding getting rid of _get_google_drive_file_id, I guess it depends on how likely we think it is to break in the future. Sounds like it's not going to break as I can't imagine google breaking those existing URLs, so I guess it's fine to keep it.

torchvision/datasets/caltech.py Show resolved Hide resolved
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @pmeier

@NicolasHug NicolasHug merged commit 6a936e4 into pytorch:main Feb 8, 2024
80 of 82 checks passed
NicolasHug pushed a commit to NicolasHug/vision that referenced this pull request Feb 8, 2024
@pmeier pmeier mentioned this pull request Feb 14, 2024
facebook-github-bot pushed a commit that referenced this pull request Mar 20, 2024
#8237)

Reviewed By: vmoens

Differential Revision: D55062803

fbshipit-source-id: bef350179c70043ad71bf57bcc18ac14bdd3b487
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants