Pipeline cleanup #23

limnoliver · 2020-07-30T20:52:52Z

This PR cleans up several outstanding issues, including:

Including new EcoSHEDS file sent by Jeff Walker on 7/23 that updates through June 2020.
Moves all gets to getters.yml to avoid unnecessary builds per Alison's suggestion, and closes Move data targets to getters.yml to avoid double builds #19
Follows new pull patterns developed in the national-flow-observations pipeline, which are mainly using readNWISuv and readNWISdv so we can use open start and end times to get all data (avoids having to reset dates, which I have forgotten in the past!). Also uses dummy dates to trigger repull, and uses those dates in pull IDs. This closes Follow new pull patterns in flow pipeline #18
Limits NWIS pull to streams and springs, which avoids the largest lake/reservoir sites that were causing problems. This closes write a catch for uv sites with too much data #14 - and I suggest adding an NWIS pull to lake-temperature-model-prep to add value to that dataset, but to also limit lake pulls to only areas where you need them.
Repulls the data!

Of course, tinkering around caused new issues:

We are missing WQP sites that have no state label, because of a switch to having to loop through states to make a whatWQPdata call to retrieve observation counts to partition data pulls. Some of these should be missing because they are outside of the U.S. and would get filtered downstream anyway. However, some of these are in the U.S. and truly just missing state labels. See Missing WQP sites #22 for a description and possible solutions.
UV data munge was causing memory failures in R. My solution was to reduce down to daily mean values in the combine step, so that the raw data is not preserved in the shared cache. I think this approach is ok, since we have a reproducible pipeline + are not using the raw data.

…asier/safer rebuilds, and modified calls to whatWQPdata since national pulls no longer work. Now query by state, counties if state fails.

… using readNWISdv and readNWISuv (instead of readNWISdata). This allows start and end dates to be set to "".

… miscellaneous sites. Closes USGS-R#18 and USGS-R#19

…xed algorithm for duplicate site/date records such that it ensures each date where a measurement was made is preserved.

…o handle data types. Wrote new error handler for POST. Fewer WQP sites was expected because we are not getting sites that don't have a state designated (~20k sites). Fewer NWIS sites expected because we are not pulling lakes/reservoirs.

…alues. I was running into memory issues when trying to reduce data after it was bound together. The downside of this is we don't keep high frequency data in the shared cache (just locally for whoever did the pull)

…mp column to use. UV data was causing memory bog downs when I was trying to filter later on on the bound data. This fixes memory issues, but does not preserve raw data.

aappling-usgs

I don't have time to fully, fully review the PR, but below are some discussion points, and I didn't spot anything that really needs changing.

aappling-usgs · 2020-07-31T14:26:24Z

1_wqp_pull/out/wqp_data_summary.csv

-"n_obs","n_sites"
-51620919,355286
+n_obs,n_sites
+49640108,340901


Nice to have this summary file in the repo. Is the loss of observations mostly due to losing those sites without state codes, you think? about 15k of them?

Yes - this magnitude was expected based on my comparison with the old pull and missing state codes.

aappling-usgs · 2020-07-31T14:40:03Z

1_wqp_pull/src/get_wqp_data.R

+    wqp_out <- try(wqp_post_try(wqp_args))
+
+    if (class(wqp_out) == 'try-error') {
+      message("Error with call to POST, trying dataRetrieval::readWQPdata")


Curious: when does this error seem to occur, and when it does, does readWQPdata usually fix it?

This was fixing an old behavior (occasional fails of the POST call) that could probably be removed. It looked like all calls this time around were successful through POST (prints a message during the pull about which was used).

aappling-usgs · 2020-07-31T14:43:11Z

1_wqp_pull/src/wqp_inventory.R

+
+  # first try the full state pull, wrapped in try so function does not fail
+  # with an error. 
+  temp_dat <- try(wqp_call(whatWQPdata, wqp_args[c('characteristicName', 'statecode')]))


I wonder if it'd someday be worth our time (and/or Laura's) to add some fault tolerance directly into dataRetrieval. Seems like to get through big pulls we always need stuff like this.

aappling-usgs · 2020-07-31T14:47:46Z

2_nwis_pull/inout/nwis_uv_inventory_summary.csv

-"n_sites","n_records","earliest","latest"
-1223,1244622,1989-10-17,2019-12-31
+n_sites,n_records,earliest,latest
+1290,1339113,1989-10-17,2020-07-27


I'm impressed that we gained 67 sites between December and July!

aappling-usgs · 2020-07-31T14:48:24Z

2_nwis_pull/out/nwis_uv_summary.csv

-"n_obs","n_sites"
-83725239,919
+n_obs,n_sites
+656498,617


Big loss in sites here. Is this due to no longer pulling lake data?

Yes, that's my only explanation for gain in uv inventory, but loss in data (sites get filtered after inventory).

aappling-usgs · 2020-07-31T14:52:31Z

4_other_sources.yml

      out_ind = target_name)

  4_other_sources/out/ecosheds_sites.rds.ind:
    command: unzip_extract_sites(
-      zip_ind = '4_other_sources/in/sheds-public-data-20200622.zip.ind',
+      zip_ind = '4_other_sources/in/sheds-public-data-20200723.zip.ind',


Be careful with committing the zip files themselves (below in this PR) - looks like you avoided it for 'sheds-public-data-20200723.zip'. If we start to hit repo size limits for this repo, I think '4_other_sources/in/sheds-public-data-20200622.zip', would be an excellent candidate for deep removal from the repo history (https://docs.github.com/en/github/authenticating-to-github/removing-sensitive-data-from-a-repository)

Yeah saw that mistake as well, did not mean to commit the first time around!

aappling-usgs · 2020-07-31T15:01:52Z

5_data_munge/src/munge_wqp_files.R

+    !is.na(`ActivityDepthHeightMeasure/MeasureValue`) ~ `ActivityDepthHeightMeasure/MeasureUnitCode`,
+    is.na(`ActivityDepthHeightMeasure/MeasureValue`) & !is.na(`ResultDepthHeightMeasure/MeasureValue`) ~ `ResultDepthHeightMeasure/MeasureUnitCode`,
+    is.na(`ActivityDepthHeightMeasure/MeasureValue`) & is.na(`ResultDepthHeightMeasure/MeasureValue`) & !is.na(`ActivityTopDepthHeightMeasure/MeasureValue`) ~ `ActivityTopDepthHeightMeasure/MeasureUnitCode`,
+    is.na(`ActivityDepthHeightMeasure/MeasureValue`) & is.na(`ResultDepthHeightMeasure/MeasureValue`) & is.na(`ActivityTopDepthHeightMeasure/MeasureValue`) ~ `ActivityBottomDepthHeightMeasure/MeasureUnitCode`


This guidance in ?case_when makes me think you could get away with a simpler code block above.

# Like an if statement, the arguments are evaluated in order, so you must # proceed from the most specific to the most general. This won't work: case_when( TRUE ~ as.character(x), x %% 5 == 0 ~ "fizz", x %% 7 == 0 ~ "buzz", x %% 35 == 0 ~ "fizz buzz" )

So I'm fairly confident you could do

dat <- mutate(dat, sample_depth_unit_code = case_when( !is.na(`ActivityDepthHeightMeasure/MeasureValue`) ~ `ActivityDepthHeightMeasure/MeasureUnitCode`, !is.na(`ResultDepthHeightMeasure/MeasureValue`) ~ `ResultDepthHeightMeasure/MeasureUnitCode`, !is.na(`ActivityTopDepthHeightMeasure/MeasureValue`) ~ `ActivityTopDepthHeightMeasure/MeasureUnitCode`, TRUE ~ `ActivityBottomDepthHeightMeasure/MeasureUnitCode` ))

double checking with a more similar example:

> tibble(a=c(1, NA, NA, NA, 1), b=c(NA, 2, NA, 2, NA), c=c(NA, NA, 3, 3, 3)) %>% + mutate(best = case_when(!is.na(a) ~ a, !is.na(b) ~ b, !is.na(c) ~ c, TRUE ~ 0)) # A tibble: 5 x 4 a b c best <dbl> <dbl> <dbl> <dbl> 1 1 NA NA 1 2 NA 2 NA 2 3 NA NA 3 3 4 NA 2 3 2 5 1 NA 3 1

wdwatkins · 2020-08-19T22:37:46Z

@limnoliver I'm ready to run the pipeline with the new sites, if this PR is ready to go

limnoliver added 13 commits July 24, 2020 09:32

moved gets to getters.yml, created a dependency on a dummy date for e…

672f733

…asier/safer rebuilds, and modified calls to whatWQPdata since national pulls no longer work. Now query by state, counties if state fails.

added wqp so we can also have nwis_pull_date

fcab6cd

move NWIS gets to getters.yml

12acb28

incorporate pull date dummy variable in order to trigger rebuild

fc4ca7e

used same pattern that we use in national-flow-observations, which is…

a665825

… using readNWISdv and readNWISuv (instead of readNWISdata). This allows start and end dates to be set to "".

fix dates. Also, this should closes USGS-R#18

1284551

rebuild inventory using new task table

c2fd9a3

finish moving to getters, run through partitions, filter out lake and…

c1e2d3c

… miscellaneous sites. Closes USGS-R#18 and USGS-R#19

no longer using service in nwis pull params, dateTime is now Date, fi…

bda8571

…xed algorithm for duplicate site/date records such that it ensures each date where a measurement was made is preserved.

update ecosheds file - new file was sent to Sam from Jeff Walker on 7/23

0f974d4

Moved collapse of NWIS uv data to combine step when choosing which te…

c3d28aa

…mp column to use. UV data was causing memory bog downs when I was trying to filter later on on the bound data. This fixes memory issues, but does not preserve raw data.

limnoliver requested a review from aappling-usgs July 30, 2020 20:53

aappling-usgs mentioned this pull request Jul 31, 2020

Move processing to Denali? #24

Open

aappling-usgs approved these changes Jul 31, 2020

View reviewed changes

limnoliver mentioned this pull request Aug 20, 2020

Simplify case_when in munge #25

Closed

limnoliver merged commit 1496ffd into USGS-R:master Aug 20, 2020

limnoliver deleted the pipeline_cleanup branch January 22, 2021 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline cleanup #23

Pipeline cleanup #23

limnoliver commented Jul 30, 2020 •

edited

Loading

aappling-usgs left a comment

aappling-usgs Jul 31, 2020

limnoliver Jul 31, 2020

aappling-usgs Jul 31, 2020

limnoliver Jul 31, 2020

aappling-usgs Jul 31, 2020

aappling-usgs Jul 31, 2020

aappling-usgs Jul 31, 2020

limnoliver Jul 31, 2020

aappling-usgs Jul 31, 2020

limnoliver Jul 31, 2020

aappling-usgs Jul 31, 2020

aappling-usgs Jul 31, 2020

wdwatkins commented Aug 19, 2020

Pipeline cleanup #23

Pipeline cleanup #23

Conversation

limnoliver commented Jul 30, 2020 • edited Loading

aappling-usgs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wdwatkins commented Aug 19, 2020

limnoliver commented Jul 30, 2020 •

edited

Loading