Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registering halfway products #31

Open
yakir12 opened this issue Feb 7, 2018 · 8 comments
Open

Registering halfway products #31

yakir12 opened this issue Feb 7, 2018 · 8 comments

Comments

@yakir12
Copy link
Contributor

yakir12 commented Feb 7, 2018

So I'm using this wonderful package, and that saves me the need to redownload stuff every time I want to reevaluate things. Great. But this got me thinking:
Usually we have this static dataset we want to process. Most often this means there will be some processed data files that result from this initial processing. We then want to do some analysis on those processed files. But it's irritating to have to take care of those halfway processing products. It would be amazing if there could be a way to register these midway processed files, so that next time we need them we won't need to recalculate them.
I think all the facilities are already here (e.g. supplying an alternative download method, one that process the files), but I would appreciate an example made just for this use case and I'll argue that many people would love this functionality just as much as the intended use of this awesome package.

One glaring problem is the need to have something similar to a make file, to check if the source files are newer than the ones the processed files rely on...

@oxinabox
Copy link
Owner

oxinabox commented Feb 8, 2018

I agree, I think it is definitely an avenue worth exploring,
It is indeed possible generally with post_fetch_mathod,
What it not (trivially) possible is to synthesis new files out of multiple distinct files (e,g merge two files into one).

Definitely worth exploring, with some examples.

One glaring problem is the need to have something similar to a make file, to check if the source files are newer than the ones the processed files rely on...

Non-trivial dependency chains -- like BinDeps.
We don't support them, but maybe we can indeed get away with just always running all computations when a dependency is fetched.

Registering the halfway files yourself, on your own server (or dropbox or something), as a second datadep is an option, but then as you say synchronizing them becomes an issue.

Also it should be noted DataDeps.jl doesn't know anything about files, except between the fetch_method and post_fetch_method invocations.
I doesn't know what files a datadep contains -- it is folder orientated.

@yakir12
Copy link
Contributor Author

yakir12 commented Feb 8, 2018

DataDeps works best when the dependent data is static. It know nothing about what generated the data it fetches. I think that such a "limitation" is totally acceptable in this scenario as well. So a dependency on some halfway files doesn't need to know if their dependent data changed. We can leave that responsibility to the user. We can add a convenience pre_fetch_method (btw, also to this package) that allows the user to supply some meta-analysis function that pre-processes some data the "to-be-fetched" data depends on, namely the processing that occurs on the data DataDeps fetched which results in the halfway files... Sorry, this is very difficult to describe, hope you understand me.

As to the implementation, maybe I should post a call on Discourse?

@oxinabox
Copy link
Owner

oxinabox commented Feb 8, 2018

A call for what?
(I do not in general care for discourse. Things are better on Github, where they are linked to the package, or on Slack where responses are immediate)

@yakir12
Copy link
Contributor Author

yakir12 commented Feb 8, 2018

A call for help with the actual implementation :)

My experience is that sometimes people get very excited by an idea for a package and the package gets quickly built thereafter. Because I my julia-fu is probably not good enough I won't be able to do much except some pointed help here and there. I assume you are probably very busy...

@oxinabox
Copy link
Owner

oxinabox commented Feb 8, 2018

Not too busy to maintain my own packages, no.
Particularly not ones that I think are important to other people.
(I mean I am crazy busy, but maintain packages is part of the reason why rather than the thing I don't have time to do on top of.)

@yakir12
Copy link
Contributor Author

yakir12 commented Feb 8, 2018

Awesome. As an academic I'm completely convinced this package is super useful. And if you feel you can bake in the halfway deps functionality into this one, then awesome!

I guess the most common use case is that the halfway files are stored locally: the product of the analysis resides on the computer where the analysis occurs. Maybe not always, but at least in a large proportion of the cases. So it would make sense to register the files as local. I guess we need a test case to see how it would look. And I guess we need that pre_fetch_method.

@oxinabox
Copy link
Owner

oxinabox commented Feb 8, 2018

Just so there is an example of the extent that is currently possible.
The following downloads 1 file,
but after post_fetch_method is done there will be 2.
As the post_fetch_method generates the second one

using DataDeps

RegisterDataDep("TrumpTweets",
"""
Tweets from 538's article:
[The World’s Favorite Donald Trump Tweets](https://fivethirtyeight.com/features/the-worlds-favorite-donald-trump-tweets/)

Includes a filtered view that is 
the tweats filtered to remove any tweets that @mention anyone, so no coversations etc, just announcements of opinions/thoughts.

Used under Creative Commons Attribution 4.0 International License.
""",
"https://raw.githack.com/fivethirtyeight/data/master/trump-twitter/realDonaldTrump_poll_tweets.csv",
"5a63b6cb2503a20517b5d41bd73e821ffbfdddd5cdc1977a547f1c925790bb15",
post_fetch_method = function(in_fn) # Multiline anon function.
	out_fn = "filtered_"*basename(in_fn)
	print(out_fn)
	open(out_fn, "w") do out_fh
		for line in eachline(in_fn)
			if !contains(line, "@")
				println(out_fh, line)
			end
		end
	end

end
)

# Read the file that we are generating
for line in eachline(datadep"TrumpTweets/filtered_realDonaldTrump_poll_tweets.csv")
	println(line)
	println()
end

@yakir12
Copy link
Contributor Author

yakir12 commented Feb 8, 2018

That is mega cool. Sorry for being slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants