Registering halfway products #31

yakir12 · 2018-02-07T20:30:47Z

So I'm using this wonderful package, and that saves me the need to redownload stuff every time I want to reevaluate things. Great. But this got me thinking:
Usually we have this static dataset we want to process. Most often this means there will be some processed data files that result from this initial processing. We then want to do some analysis on those processed files. But it's irritating to have to take care of those halfway processing products. It would be amazing if there could be a way to register these midway processed files, so that next time we need them we won't need to recalculate them.
I think all the facilities are already here (e.g. supplying an alternative download method, one that process the files), but I would appreciate an example made just for this use case and I'll argue that many people would love this functionality just as much as the intended use of this awesome package.

One glaring problem is the need to have something similar to a make file, to check if the source files are newer than the ones the processed files rely on...

oxinabox · 2018-02-08T01:17:49Z

I agree, I think it is definitely an avenue worth exploring,
It is indeed possible generally with post_fetch_mathod,
What it not (trivially) possible is to synthesis new files out of multiple distinct files (e,g merge two files into one).

Definitely worth exploring, with some examples.

One glaring problem is the need to have something similar to a make file, to check if the source files are newer than the ones the processed files rely on...

Non-trivial dependency chains -- like BinDeps.
We don't support them, but maybe we can indeed get away with just always running all computations when a dependency is fetched.

Registering the halfway files yourself, on your own server (or dropbox or something), as a second datadep is an option, but then as you say synchronizing them becomes an issue.

Also it should be noted DataDeps.jl doesn't know anything about files, except between the fetch_method and post_fetch_method invocations.
I doesn't know what files a datadep contains -- it is folder orientated.

yakir12 · 2018-02-08T08:28:12Z

DataDeps works best when the dependent data is static. It know nothing about what generated the data it fetches. I think that such a "limitation" is totally acceptable in this scenario as well. So a dependency on some halfway files doesn't need to know if their dependent data changed. We can leave that responsibility to the user. We can add a convenience pre_fetch_method (btw, also to this package) that allows the user to supply some meta-analysis function that pre-processes some data the "to-be-fetched" data depends on, namely the processing that occurs on the data DataDeps fetched which results in the halfway files... Sorry, this is very difficult to describe, hope you understand me.

As to the implementation, maybe I should post a call on Discourse?

oxinabox · 2018-02-08T08:35:49Z

A call for what?
(I do not in general care for discourse. Things are better on Github, where they are linked to the package, or on Slack where responses are immediate)

yakir12 · 2018-02-08T08:43:10Z

A call for help with the actual implementation :)

My experience is that sometimes people get very excited by an idea for a package and the package gets quickly built thereafter. Because I my julia-fu is probably not good enough I won't be able to do much except some pointed help here and there. I assume you are probably very busy...

oxinabox · 2018-02-08T15:40:01Z

Not too busy to maintain my own packages, no.
Particularly not ones that I think are important to other people.
(I mean I am crazy busy, but maintain packages is part of the reason why rather than the thing I don't have time to do on top of.)

yakir12 · 2018-02-08T16:13:57Z

Awesome. As an academic I'm completely convinced this package is super useful. And if you feel you can bake in the halfway deps functionality into this one, then awesome!

I guess the most common use case is that the halfway files are stored locally: the product of the analysis resides on the computer where the analysis occurs. Maybe not always, but at least in a large proportion of the cases. So it would make sense to register the files as local. I guess we need a test case to see how it would look. And I guess we need that pre_fetch_method.

oxinabox · 2018-02-08T16:49:12Z

Just so there is an example of the extent that is currently possible.
The following downloads 1 file,
but after post_fetch_method is done there will be 2.
As the post_fetch_method generates the second one

using DataDeps

RegisterDataDep("TrumpTweets",
"""
Tweets from 538's article:
[The World’s Favorite Donald Trump Tweets](https://fivethirtyeight.com/features/the-worlds-favorite-donald-trump-tweets/)

Includes a filtered view that is 
the tweats filtered to remove any tweets that @mention anyone, so no coversations etc, just announcements of opinions/thoughts.

Used under Creative Commons Attribution 4.0 International License.
""",
"https://raw.githack.com/fivethirtyeight/data/master/trump-twitter/realDonaldTrump_poll_tweets.csv",
"5a63b6cb2503a20517b5d41bd73e821ffbfdddd5cdc1977a547f1c925790bb15",
post_fetch_method = function(in_fn) # Multiline anon function.
	out_fn = "filtered_"*basename(in_fn)
	print(out_fn)
	open(out_fn, "w") do out_fh
		for line in eachline(in_fn)
			if !contains(line, "@")
				println(out_fh, line)
			end
		end
	end

end
)

# Read the file that we are generating
for line in eachline(datadep"TrumpTweets/filtered_realDonaldTrump_poll_tweets.csv")
	println(line)
	println()
end

yakir12 · 2018-02-08T20:31:03Z

That is mega cool. Sorry for being slow.

oxinabox mentioned this issue Feb 9, 2018

add example of synthesizing new files #33

Merged

oxinabox mentioned this issue Mar 5, 2018

Allow fine grained download control and make post_fetch_method more general #35

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Registering halfway products #31

Registering halfway products #31

yakir12 commented Feb 7, 2018

oxinabox commented Feb 8, 2018 •

edited

Loading

yakir12 commented Feb 8, 2018

oxinabox commented Feb 8, 2018

yakir12 commented Feb 8, 2018

oxinabox commented Feb 8, 2018 •

edited

Loading

yakir12 commented Feb 8, 2018

oxinabox commented Feb 8, 2018

yakir12 commented Feb 8, 2018

Registering halfway products #31

Registering halfway products #31

Comments

yakir12 commented Feb 7, 2018

oxinabox commented Feb 8, 2018 • edited Loading

yakir12 commented Feb 8, 2018

oxinabox commented Feb 8, 2018

yakir12 commented Feb 8, 2018

oxinabox commented Feb 8, 2018 • edited Loading

yakir12 commented Feb 8, 2018

oxinabox commented Feb 8, 2018

yakir12 commented Feb 8, 2018

oxinabox commented Feb 8, 2018 •

edited

Loading

oxinabox commented Feb 8, 2018 •

edited

Loading