Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow fine grained download control and make post_fetch_method more general #35

Open
Evizero opened this issue Mar 5, 2018 · 6 comments

Comments

@Evizero
Copy link
Collaborator

Evizero commented Mar 5, 2018

Context

So I have recently added the SVHN dataset (format 2) to MLDatasets. The interesting thing about that dataset is that it is only available as three decently sized .mat files (the "extra" files is 1.2 GB, while train and test are only like 200 MB together). Technically neither the sizes nor the fact that they are .mat files are a problem, because there is MAT.jl to read them. Both properties have a few drawbacks though.

  1. Many people may not need the huge "extra" file, so it would be nice if I could split the download into a "sub-datadep". I saw the MNIST example in your tests, but the issue there is that my SVHN.download (which should be able to download all files) method would need to display multiple download prompts, which is quite ugly.
  2. Reading .mat files is much slower compared to a simple binary format like MNIST or CIFAR has
  3. Its not possible to just read specific observations, but instead must always load the full split (train, test, or extra)

Problem Description

Concerning the sub-datadeps. do you have any thoughts here? I do think having some fine grained control which files to download on demand is quite nice if it doesn't require a lot of overhead or workarounds. Maybe its worth considering making something like that a first class concept

Concerning the .mat files I think it would be nice if the first thing the MLDatasets does after downloading the .mat files, is to transform them in a more comfortable format. I could do that already with post_fetch_method, but the problem here is that it wouldn't quite work for my use case. As you know all the MLDatasets methods allow the user to specify a custom "dir" where the data can be found. The idea here is that a user can pre-download the native files from the website and just tell the package where those native files are. So the thing i would require for my use case to work out here is that the package checks if the .mat files already exist, and if so, only perform the post_fetch_method without the download. Is this something you'd consider out of scope for DataDeps.jl?

@oxinabox
Copy link
Owner

oxinabox commented Mar 5, 2018

Lets focus on the first part (linked to 1)

It seems like second part is more like the thoughts around #31 .
Though perhaps they end up being linked in the end, as both are linked to knowing more about what is inside the datadep.
I think #31 might be better motivated.

@oxinabox
Copy link
Owner

oxinabox commented Mar 5, 2018

So are you thinking maybe it would be good to have a
ParentDataDep that can depend on other datadeps?

Something like

RegisterDataDep("Child1", "msg1", url1)
RegisterDataDep("Child2", "msg2", url1)
RegisterDataDep("Parent", "msg3", DependsOn("Child1", "Child2"))

?

@Evizero
Copy link
Collaborator Author

Evizero commented Mar 5, 2018

I am unsure about a good architecture.

From an interface standpoint the following would be nice:

  • Ability to either request the download of a complete dataset or just parts of a dataset. either way only a single prompt is displayed.
  • The dataset shares a common flat directory.
  • maybe the parent datadep defines the typical prompt display text and each child just defines one or two additional sentences describing the specific file. The displayed prompt is then just a smart accumulation of those depending of what is downloaded

@oxinabox
Copy link
Owner

On sharing a common flat directory.
I see 2 paths. Within the idea of having child /chained dependencies.

The easy one would be to try and symlink (fallback to copy on windows) the childs files in.

Problem is windows. Apparently it gas symlinks but no way to make them in julia/libuv?

The hard one would be enhancing the location stuff to be more aware of subdataset relationships. A kinda virtual filesystem abstraction.

@oxinabox
Copy link
Owner

JuliaLang/julia#24667

@oxinabox
Copy link
Owner

A key conceptual idea I guess is:

Is it a DataDep that depends on another DataDep?
This is what I had been thinking up til now.
Which matches well to what is true for package and binary dependancies.
Dependencies can have dependencies.
Which often will be external works by other authors.

But maybe that does not map well to data dependencies.
because data is non-executable, and thus independent.
Data that would be child data is almost certainly from the same general source as other data that "depends" (or codepends) upon it.

So maybe really looking at it as a structured components is more reasonable/useful.

So the registration blocks for each component is normally written by the same author as the main (if there even are child blocks at all).

So DataDeps.jl currently supports arbitrary tree structure for remote_path (and checksum, post_fetch_method, fetch_method, but not currently for message).
And this structure is implemented via tuples and nested tuples.

Maybe we can have named structure, using dicts instead of tuples.
and the sub-names would need to be remembered and used like a datadep name is now.
Such names would need to corespond to subfolders i think.

But idk that gives two types of subfolders -- once that have sub-names attached (and so can be resolved if not found) and onces that do not (because they were generated by the post_fetch_method)
But maybe that is not really a problem.
Only missing data needs handling, if the data is there then the data is there.

So maybe the logic for if a file is not found,
can be extended to if the file is not found attempt to resolve the folder as a subname -- if that does not work attempt to resolve that subname's parent folder.

It is complicated, but it does let one download less data...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants