Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brittle coop #249

Merged
merged 22 commits into from
Dec 14, 2021
Merged

Brittle coop #249

merged 22 commits into from
Dec 14, 2021

Conversation

jordansread
Copy link

@jordansread jordansread commented Dec 10, 2021

Getting this PR up because it has ind/build changes. Will add more context later this am. But is tied to #231 and things @padilla410 discovered in #248

update

The focus of this PR is to make working on new temperature data sources more developer-friendly, by clearing out several of the gotchas in target connections and triggers that in hindsight, explain a number of mystery builds from the past.

  • I've pulled in changes from Julie's open PR so that we're not colliding on indicator file changes. I will keep up with updates or merge of that PR if things change and get them into this PR too. Pulling in changes from an open PR is generally not a practice we'd do, but in scipiper/shared cache repos, it is sometimes necessary.
  • I moved the "forever stale" target from Create an "always stale" target that generates the hash table of parser functions #242 to 1_fetch so it could be used by other steps easily
  • I added the stale trigger file to two targets in 6_coop, which include indexing the files on google (was triggered by a dummy string in the past) and the coop_wants target that relies on the index ind. It is necessary to pair targets to use the trigger file, otherwise it doesn't work correctly (this explains why coop_wants uses it too)
  • since the google drive index will happen with every build that depends on it now and that call to google is expensive and we could get API-limited if accessing too often(?), I put in a time check on the indicator for the file index (see below for freshened: 2021-12-10 05:44:22 -0600) and it won't re-index google drive unless the last build was less than trigger_wait minutes ago (I set this to 30). This seems a little questionable, but I feel confident it is a major improvement over what was happening in the past
  • the NTL targets were removed, which was to address Troubleshooting initial pipeline build #248. Those are in the in folder anyhow and we don't need to re-download them. I did go ahead and update one of the files since a newer version was available, so that explains the file name and parser change in here
  • I linked the downloaded file index to the 7a_ munge task table, since those weren't connected before. Likely another source of confusing double builds that happened in the past.
  • Added NULL as default on the trigger file targets, so leaving out that arg will skip the trigger stale step in case you want to avoid this when developing.

@@ -1,2 +1,3 @@
hash: 59795c7efcc7526f0462cdac07698f21
hash: 7566d6ac6bfe10e47a9a3a90065e827e
freshened: 2021-12-10 05:44:22 -0600
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to scdel() this and rename from .ind to .tind (time-indicator) to avoid confusion with the extra information

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, that isn't possible because the various scipiper functions check that it has a valid name for an indicator file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified this with an updated commit

@jordansread jordansread marked this pull request as ready for review December 10, 2021 13:48
@limnoliver
Copy link
Contributor

Nice fixes to some questionable choices I made 🙈! Just a few comments -- I still don't understand why you need the always stale trigger in both targets? (but I see your note, so trust you've thought it through)

@@ -15,7 +15,7 @@ parse_mendota_daily_buoy <- function(inind, outind) {
sc_indicate(ind_file = outind, data_file = outfile)
}

parse_long_term_ntl <- function(inind, outind) {
parse_ntl29_v10_0 <- function(inind, outind) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will impact #247 open PR if it re-pulls google drive files


coop_wants:
command: filter_coop_all('6_temp_coop_fetch/out/coop_all_files.rds.ind')
command: filter_coop_all('6_temp_coop_fetch/out/coop_all_files.rds.ind',
trigger_file = '1_crosswalk_fetch/out/always_stale_time.txt')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this trigger when the target is relying on 6_temp_coop_fetch/out/coop_all_files.rds.ind which also has a always stale trigger?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this is related to the way remake is tracking build changes while things are executing. If we only have one target in the dependency chain that has the trigger file, it oddly is not considered stale the second build (even though you would expect it to, since it has secretly modified the trigger_file as a side-effect of the build). I can't remember exactly what I determined here - perhaps if you don't have two trigger_file-associated targets, you only get rebuilds every other time, which is frustrating and confusing. So my solution is to always pair them when used as this is reliable 100% of the time.


# do we trigger a rebuild if the file contents changes? not just the names?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think we were only assuming file name changes (or file adds) would trigger rebuilds.

6_temp_coop_fetch/in/Winnie_temp_2_2009.xlsx.ind: c19ce966b9c5a6072c0cf1d1516dd2f2
6_temp_coop_fetch/in/DNRdatarequest_Secchi_DO_and_Temp_1083_2016_AllLakes_JL.xlsx.ind: 22c8cc6fc3af2311c2d90edf9940fedb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some of these just change order? I see the same hash. Could add some sort function to the files_in_drive function that does the GD inventory?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created this scipiper issue and I think the ordering that matters here is in sc_indicate instead of the result from drive_ls, since this .ind file is the result of the task table and not of the GD inventory function.

But yes, you are seeing the re-ordering, not a new hash. It is distracting to see diffs when the hashes aren't changing. I'm with ya

(difftime(Sys.time(), as.POSIXct(ind_data$freshened), units = 'mins')) > trigger_wait){
gd_freshen_dir(out_ind, gd_path)
} else {
message('skipping google drive index because it still smells fresh')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! love it.

- 6_temp_coop_fetch/log/6_temp_coop_fetch_tasks.ind

# -- download LTER files and push to GD -- #
coop_file_upload_location:
command: as_id(I('1dutCiFEOoRObXLn6n3BRKhZDDAijK65Q'))

6_temp_coop_fetch/downloads/long_term_ntl.csv.ind:
command: download_and_indicate(in_ind = target_name,
url = I("https://lter.limnology.wisc.edu/sites/default/files/data/ntl29_v5.csv"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the download URL still being documented somewhere? So now it's on google drive in the "in" folder, so that's our new canonical for this data?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was originally going into that in automatically too, since in the download_and_indicate() function:

googledrive::drive_upload(media = local_file, path = push_location)

Where push_location is the google drive location for the in folder. So this was going into downloads and into in. in is where it became part of the coop parsers and I think the downloads folder is ignored by this pipeline.

Is the download URL still being documented somewhere?
yes, I added an explainer file for this target.

@jordansread jordansread merged commit 531f7cb into DOI-USGS:main Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants