-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow fine grained download control and make post_fetch_method more general #35
Comments
So are you thinking maybe it would be good to have a Something like
? |
I am unsure about a good architecture. From an interface standpoint the following would be nice:
|
On sharing a common flat directory. The easy one would be to try and symlink (fallback to copy on windows) the childs files in. Problem is windows. Apparently it gas symlinks but no way to make them in julia/libuv? The hard one would be enhancing the location stuff to be more aware of subdataset relationships. A kinda virtual filesystem abstraction. |
A key conceptual idea I guess is: Is it a DataDep that depends on another DataDep? But maybe that does not map well to data dependencies. So maybe really looking at it as a structured components is more reasonable/useful. So the registration blocks for each component is normally written by the same author as the main (if there even are child blocks at all). So DataDeps.jl currently supports arbitrary tree structure for Maybe we can have named structure, using dicts instead of tuples. But idk that gives two types of subfolders -- once that have sub-names attached (and so can be resolved if not found) and onces that do not (because they were generated by the So maybe the logic for if a file is not found, It is complicated, but it does let one download less data... |
Context
So I have recently added the SVHN dataset (format 2) to MLDatasets. The interesting thing about that dataset is that it is only available as three decently sized
.mat
files (the "extra" files is 1.2 GB, while train and test are only like 200 MB together). Technically neither the sizes nor the fact that they are.mat
files are a problem, because there isMAT.jl
to read them. Both properties have a few drawbacks though.SVHN.download
(which should be able to download all files) method would need to display multiple download prompts, which is quite ugly.Problem Description
Concerning the sub-datadeps. do you have any thoughts here? I do think having some fine grained control which files to download on demand is quite nice if it doesn't require a lot of overhead or workarounds. Maybe its worth considering making something like that a first class concept
Concerning the
.mat
files I think it would be nice if the first thing the MLDatasets does after downloading the.mat
files, is to transform them in a more comfortable format. I could do that already withpost_fetch_method
, but the problem here is that it wouldn't quite work for my use case. As you know all the MLDatasets methods allow the user to specify a custom "dir" where the data can be found. The idea here is that a user can pre-download the native files from the website and just tell the package where those native files are. So the thing i would require for my use case to work out here is that the package checks if the.mat
files already exist, and if so, only perform thepost_fetch_method
without the download. Is this something you'd consider out of scope for DataDeps.jl?The text was updated successfully, but these errors were encountered: