Brittle coop #249

jordansread · 2021-12-10T11:59:55Z

Getting this PR up because it has ind/build changes. ~~Will add more context later this am.~~ But is tied to #231 and things @padilla410 discovered in #248

update

The focus of this PR is to make working on new temperature data sources more developer-friendly, by clearing out several of the gotchas in target connections and triggers that in hindsight, explain a number of mystery builds from the past.

I've pulled in changes from Julie's open PR so that we're not colliding on indicator file changes. I will keep up with updates or merge of that PR if things change and get them into this PR too. Pulling in changes from an open PR is generally not a practice we'd do, but in scipiper/shared cache repos, it is sometimes necessary.
I moved the "forever stale" target from Create an "always stale" target that generates the hash table of parser functions #242 to 1_fetch so it could be used by other steps easily
I added the stale trigger file to two targets in 6_coop, which include indexing the files on google (was triggered by a dummy string in the past) and the coop_wants target that relies on the index ind. It is necessary to pair targets to use the trigger file, otherwise it doesn't work correctly (this explains why coop_wants uses it too)
since the google drive index will happen with every build that depends on it now and that call to google is expensive and we could get API-limited if accessing too often(?), I put in a time check on the indicator for the file index (see below for freshened: 2021-12-10 05:44:22 -0600) and it won't re-index google drive unless the last build was less than trigger_wait minutes ago (I set this to 30). This seems a little questionable, but I feel confident it is a major improvement over what was happening in the past
the NTL targets were removed, which was to address Troubleshooting initial pipeline build #248. Those are in the in folder anyhow and we don't need to re-download them. I did go ahead and update one of the files since a newer version was available, so that explains the file name and parser change in here
I linked the downloaded file index to the 7a_ munge task table, since those weren't connected before. Likely another source of confusing double builds that happened in the past.
Added NULL as default on the trigger file targets, so leaving out that arg will skip the trigger stale step in case you want to avoid this when developing.

…temperature-model-prep into brittle_coop

jordansread · 2021-12-10T12:00:50Z

6_temp_coop_fetch/out/coop_all_files.rds.ind

@@ -1,2 +1,3 @@
-hash: 59795c7efcc7526f0462cdac07698f21
+hash: 7566d6ac6bfe10e47a9a3a90065e827e
+freshened: 2021-12-10 05:44:22 -0600


going to scdel() this and rename from .ind to .tind (time-indicator) to avoid confusion with the extra information

Never mind, that isn't possible because the various scipiper functions check that it has a valid name for an indicator file.

Modified this with an updated commit

limnoliver · 2021-12-10T14:47:09Z

Nice fixes to some questionable choices I made 🙈! Just a few comments -- I still don't understand why you need the always stale trigger in both targets? (but I see your note, so trust you've thought it through)

jordansread · 2021-12-10T15:52:17Z

7a_temp_coop_munge/src/data_parsers/parse_wilter_files.R

@@ -15,7 +15,7 @@ parse_mendota_daily_buoy <- function(inind, outind) {
  sc_indicate(ind_file = outind, data_file = outfile)
 }

-parse_long_term_ntl <- function(inind, outind) {
+parse_ntl29_v10_0 <- function(inind, outind) {


will impact #247 open PR if it re-pulls google drive files

…into brittle_coop

limnoliver · 2021-12-10T14:18:21Z

6_temp_coop_fetch.yml


  coop_wants:
-    command: filter_coop_all('6_temp_coop_fetch/out/coop_all_files.rds.ind')
+    command: filter_coop_all('6_temp_coop_fetch/out/coop_all_files.rds.ind',
+      trigger_file = '1_crosswalk_fetch/out/always_stale_time.txt')


Do you need this trigger when the target is relying on 6_temp_coop_fetch/out/coop_all_files.rds.ind which also has a always stale trigger?

As far as I can tell, this is related to the way remake is tracking build changes while things are executing. If we only have one target in the dependency chain that has the trigger file, it oddly is not considered stale the second build (even though you would expect it to, since it has secretly modified the trigger_file as a side-effect of the build). I can't remember exactly what I determined here - perhaps if you don't have two trigger_file-associated targets, you only get rebuilds every other time, which is frustrating and confusing. So my solution is to always pair them when used as this is reliable 100% of the time.

limnoliver · 2021-12-10T14:20:07Z

6_temp_coop_fetch.yml


+  # do we trigger a rebuild if the file contents changes? not just the names?


Hmmm, I think we were only assuming file name changes (or file adds) would trigger rebuilds.

limnoliver · 2021-12-10T14:23:54Z

6_temp_coop_fetch/log/6_temp_coop_fetch_tasks.ind

 6_temp_coop_fetch/in/Winnie_temp_2_2009.xlsx.ind: c19ce966b9c5a6072c0cf1d1516dd2f2
+6_temp_coop_fetch/in/DNRdatarequest_Secchi_DO_and_Temp_1083_2016_AllLakes_JL.xlsx.ind: 22c8cc6fc3af2311c2d90edf9940fedb


Did some of these just change order? I see the same hash. Could add some sort function to the files_in_drive function that does the GD inventory?

I created this scipiper issue and I think the ordering that matters here is in sc_indicate instead of the result from drive_ls, since this .ind file is the result of the task table and not of the GD inventory function.

But yes, you are seeing the re-ordering, not a new hash. It is distracting to see diffs when the hashes aren't changing. I'm with ya

limnoliver · 2021-12-10T14:25:38Z

6_temp_coop_fetch/src/list_gd_files.R

+        (difftime(Sys.time(), as.POSIXct(ind_data$freshened), units = 'mins')) > trigger_wait){
+      gd_freshen_dir(out_ind, gd_path)
+    } else {
+      message('skipping google drive index because it still smells fresh')


ha! love it.

limnoliver · 2021-12-10T14:45:00Z

6_temp_coop_fetch.yml

      - 6_temp_coop_fetch/log/6_temp_coop_fetch_tasks.ind

  # -- download LTER files and push to GD -- #
  coop_file_upload_location:
    command: as_id(I('1dutCiFEOoRObXLn6n3BRKhZDDAijK65Q'))

-  6_temp_coop_fetch/downloads/long_term_ntl.csv.ind:
-    command: download_and_indicate(in_ind = target_name,
-             url = I("https://lter.limnology.wisc.edu/sites/default/files/data/ntl29_v5.csv"),


Is the download URL still being documented somewhere? So now it's on google drive in the "in" folder, so that's our new canonical for this data?

It was originally going into that in automatically too, since in the download_and_indicate() function:

googledrive::drive_upload(media = local_file, path = push_location)

Where push_location is the google drive location for the in folder. So this was going into downloads and into in. in is where it became part of the coop parsers and I think the downloads folder is ignored by this pipeline.

Is the download URL still being documented somewhere?
yes, I added an explainer file for this target.

Merge branch 'main' of github.com:USGS-R/lake-temperature-model-prep into brittle_coop # Conflicts: # 7a_temp_coop_munge/src/parsing_task_fxns.R

Jordan S. Read added 7 commits December 9, 2021 15:45

Merge branch 'add-navico-data' of https://github.com/padilla410/lake-…

28a3826

…temperature-model-prep into brittle_coop

feels like a bad idea!

1defb29

DOI-USGS#231 coop GD file list as 30min max always stale

6ea193b

Add NULL option for trigger file

836d447

new raw file from NTL website

4dfc17d

not using long_term_ntl.csv file anymore

5392b1e

building through coop to file munges

4047e71

jordansread commented Dec 10, 2021

View reviewed changes

Jordan S. Read added 2 commits December 10, 2021 07:04

link downloaded files target per DOI-USGS#231

0107f63

fixed comment after move to 1_fetch

aabcb81

jordansread marked this pull request as ready for review December 10, 2021 13:48

jordansread requested a review from limnoliver December 10, 2021 13:48

jordansread mentioned this pull request Dec 10, 2021

Adding Navico data and parser #247

Merged

jordansread commented Dec 10, 2021

View reviewed changes

Jordan S. Read added 4 commits December 10, 2021 11:36

clean-up of old bad file

ea0c2a5

Merge branch 'main' of github.com:USGS-R/lake-temperature-model-prep …

30c73b9

…into brittle_coop

replace .ind for coop GD files with local .tind DOI-USGS#250

c38e9ef

build download ind for coop GD files

5e9edbf

limnoliver reviewed Dec 13, 2021

View reviewed changes

Jordan S. Read added 9 commits December 13, 2021 08:45

function not used after clean-up

b0bb958

clean-up after Sam's comments

56e1e88

merge upstream main

0df5187

Merge branch 'main' of github.com:USGS-R/lake-temperature-model-prep into brittle_coop # Conflicts: # 7a_temp_coop_munge/src/parsing_task_fxns.R

removing raw_sheets from the lapply; was error

16d9db5

fresh GD index. Odd the hash changed. Order?

452c4ec

build order?

edef58f

now using GD file hashes in the tibble

6889592

Still using filename, but hashes are exposed now

5c0c283

Now sorting the file tibble before return

319b440

jordansread merged commit 531f7cb into DOI-USGS:main Dec 14, 2021

padilla410 mentioned this pull request Dec 20, 2021

Adding Navico data and parser #263

Merged

jordansread mentioned this pull request Apr 5, 2022

Tiny edits (long explanation) to prevent mysterious rebuilding #322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brittle coop #249

Brittle coop #249

jordansread commented Dec 10, 2021 •

edited

Loading

jordansread Dec 10, 2021

jordansread Dec 10, 2021

jordansread Dec 13, 2021

limnoliver commented Dec 10, 2021

jordansread Dec 10, 2021

limnoliver Dec 10, 2021

jordansread Dec 13, 2021

limnoliver Dec 10, 2021

limnoliver Dec 10, 2021

jordansread Dec 13, 2021

limnoliver Dec 10, 2021

limnoliver Dec 10, 2021

jordansread Dec 13, 2021


		# do we trigger a rebuild if the file contents changes? not just the names?

		6_temp_coop_fetch/in/Winnie_temp_2_2009.xlsx.ind: c19ce966b9c5a6072c0cf1d1516dd2f2
		6_temp_coop_fetch/in/DNRdatarequest_Secchi_DO_and_Temp_1083_2016_AllLakes_JL.xlsx.ind: 22c8cc6fc3af2311c2d90edf9940fedb

Brittle coop #249

Brittle coop #249

Conversation

jordansread commented Dec 10, 2021 • edited Loading

update

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

limnoliver commented Dec 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jordansread commented Dec 10, 2021 •

edited

Loading