-
Notifications
You must be signed in to change notification settings - Fork 812
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi30K dataset link is broken #1756
Comments
Yupp, looks like the server is down :(. |
Any solutions? |
I meet the same problem, is there any solution? |
@chenghan1995 @muskbing unfortunately, we're not responsible for hosting the datasets. I'd recommend waiting for their server to come back up or reaching out directly to the organization that hosts the dataset. In this case this would be the University of Sheffield. cc @parmeet I wonder if you know of a way to get in contact with the team that hosts this dataset? |
I've send email to the owner of the dataset email address is:[email protected] But there is no response. I wander is there anyone who has the data file |
Found a local copy of the dataset and uploaded it to github (it's rather small). For now it is available via this link: https://github.com/neychev/small_DL_repo/tree/master/datasets/Multi30k Just in case, all rights belong to the original authors of the dataset, this is only a temporal copy for convenience. |
Thanks bro, you're really awesome |
Please, refer to the next answer with updated example
|
Thank you! This worked for The test file being downloaded by torchtext (and torchdata) are from http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz Does anyone have the P.S. I was able to work around the 'test' issue by making another tar.gz from the contents of |
Just wanted to mention another approach to get Multi30k working with the data you are hosting @neychev. Rather than downloading the data directly using
As @rrmina mentioned earlier, this approach still doesn't work with the As a next step, I also plan to update our |
No idea what's exactly wrong with the data, the files above were located in I've tried to simply rename the archive (according to the name in torchtext docs) and files in it and change MD5 to the correct one and it seems to work. Including the approach suggested by @Nayef211, which is way more elegant, the final algorithm should be the following:
Test data has 1000 sentences, which seems correct. |
Reopening because the servers hosting the dataset seems to be down again. #2194 changes the links to |
Temporarily disable the T5 tutorial to fix the issue with the dataset that can't be downloaded because the website is down. More info: pytorch/text#1756
Temporarily disable the T5 tutorial to fix the issue with the dataset that can't be downloaded because the website is down. More info: pytorch/text#1756
Plus, besides commenting the previous
|
Thank for the instructions. I've had to manually extract the |
An simple general solution was suggested by @Nayef211, a Contributor, on 23. June 2022 here: #1756 (comment)
The important point here is that the URL of the wrong mmt16_task1_test.tar.gz and its hash would be replaced by the correct mmt_task1_test2016.tar.gz file and its hash. But that was somehow forgotten. I figured out the problem and the solution on my own yesterday and then I'found this suggested bug fix today :-(. @Nayef211 or other contributors, could you implement it? |
It wasn't automatically extracted because the
|
these URLs work again:
if the script still doesn't work, we can copy paste the URLs to browser to download files manually, and save them to dir |
Files in the
to
I'm new to torch2.x, it's a strange bug LOL. |
The link to Multi30K dataset at
http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz
is broken:text/torchtext/datasets/multi30k.py
Line 16 in 73bf4fa
The text was updated successfully, but these errors were encountered: