Allow fine grained download control and make post_fetch_method more general #35

Evizero · 2018-03-05T08:07:55Z

Context

So I have recently added the SVHN dataset (format 2) to MLDatasets. The interesting thing about that dataset is that it is only available as three decently sized .mat files (the "extra" files is 1.2 GB, while train and test are only like 200 MB together). Technically neither the sizes nor the fact that they are .mat files are a problem, because there is MAT.jl to read them. Both properties have a few drawbacks though.

Many people may not need the huge "extra" file, so it would be nice if I could split the download into a "sub-datadep". I saw the MNIST example in your tests, but the issue there is that my SVHN.download (which should be able to download all files) method would need to display multiple download prompts, which is quite ugly.
Reading .mat files is much slower compared to a simple binary format like MNIST or CIFAR has
Its not possible to just read specific observations, but instead must always load the full split (train, test, or extra)

Problem Description

Concerning the sub-datadeps. do you have any thoughts here? I do think having some fine grained control which files to download on demand is quite nice if it doesn't require a lot of overhead or workarounds. Maybe its worth considering making something like that a first class concept

Concerning the .mat files I think it would be nice if the first thing the MLDatasets does after downloading the .mat files, is to transform them in a more comfortable format. I could do that already with post_fetch_method, but the problem here is that it wouldn't quite work for my use case. As you know all the MLDatasets methods allow the user to specify a custom "dir" where the data can be found. The idea here is that a user can pre-download the native files from the website and just tell the package where those native files are. So the thing i would require for my use case to work out here is that the package checks if the .mat files already exist, and if so, only perform the post_fetch_method without the download. Is this something you'd consider out of scope for DataDeps.jl?

The text was updated successfully, but these errors were encountered:

oxinabox · 2018-03-05T08:38:14Z

Lets focus on the first part (linked to 1)

It seems like second part is more like the thoughts around #31 .
Though perhaps they end up being linked in the end, as both are linked to knowing more about what is inside the datadep.
I think #31 might be better motivated.

oxinabox · 2018-03-05T08:43:01Z

So are you thinking maybe it would be good to have a
ParentDataDep that can depend on other datadeps?

Something like

RegisterDataDep("Child1", "msg1", url1)
RegisterDataDep("Child2", "msg2", url1)
RegisterDataDep("Parent", "msg3", DependsOn("Child1", "Child2"))

?

Evizero · 2018-03-05T12:31:43Z

I am unsure about a good architecture.

From an interface standpoint the following would be nice:

Ability to either request the download of a complete dataset or just parts of a dataset. either way only a single prompt is displayed.
The dataset shares a common flat directory.
maybe the parent datadep defines the typical prompt display text and each child just defines one or two additional sentences describing the specific file. The displayed prompt is then just a smart accumulation of those depending of what is downloaded

oxinabox · 2018-03-11T13:33:05Z

On sharing a common flat directory.
I see 2 paths. Within the idea of having child /chained dependencies.

The easy one would be to try and symlink (fallback to copy on windows) the childs files in.

Problem is windows. Apparently it gas symlinks but no way to make them in julia/libuv?

The hard one would be enhancing the location stuff to be more aware of subdataset relationships. A kinda virtual filesystem abstraction.

oxinabox · 2018-03-11T13:35:49Z

JuliaLang/julia#24667

oxinabox · 2018-03-23T03:00:48Z

A key conceptual idea I guess is:

Is it a DataDep that depends on another DataDep?
This is what I had been thinking up til now.
Which matches well to what is true for package and binary dependancies.
Dependencies can have dependencies.
Which often will be external works by other authors.

But maybe that does not map well to data dependencies.
because data is non-executable, and thus independent.
Data that would be child data is almost certainly from the same general source as other data that "depends" (or codepends) upon it.

So maybe really looking at it as a structured components is more reasonable/useful.

So the registration blocks for each component is normally written by the same author as the main (if there even are child blocks at all).

So DataDeps.jl currently supports arbitrary tree structure for remote_path (and checksum, post_fetch_method, fetch_method, but not currently for message).
And this structure is implemented via tuples and nested tuples.

Maybe we can have named structure, using dicts instead of tuples.
and the sub-names would need to be remembered and used like a datadep name is now.
Such names would need to corespond to subfolders i think.

But idk that gives two types of subfolders -- once that have sub-names attached (and so can be resolved if not found) and onces that do not (because they were generated by the post_fetch_method)
But maybe that is not really a problem.
Only missing data needs handling, if the data is there then the data is there.

So maybe the logic for if a file is not found,
can be extended to if the file is not found attempt to resolve the folder as a subname -- if that does not work attempt to resolve that subname's parent folder.

It is complicated, but it does let one download less data...

oxinabox mentioned this issue Nov 9, 2018

Consider if DataDeps were Paths. #80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow fine grained download control and make post_fetch_method more general #35

Allow fine grained download control and make post_fetch_method more general #35

Evizero commented Mar 5, 2018 •

edited

Loading

oxinabox commented Mar 5, 2018

oxinabox commented Mar 5, 2018 •

edited

Loading

Evizero commented Mar 5, 2018

oxinabox commented Mar 11, 2018

oxinabox commented Mar 11, 2018

oxinabox commented Mar 23, 2018

Allow fine grained download control and make post_fetch_method more general #35

Allow fine grained download control and make post_fetch_method more general #35

Comments

Evizero commented Mar 5, 2018 • edited Loading

Context

Problem Description

oxinabox commented Mar 5, 2018

oxinabox commented Mar 5, 2018 • edited Loading

Evizero commented Mar 5, 2018

oxinabox commented Mar 11, 2018

oxinabox commented Mar 11, 2018

oxinabox commented Mar 23, 2018

Evizero commented Mar 5, 2018 •

edited

Loading

oxinabox commented Mar 5, 2018 •

edited

Loading