Skip to content
This repository has been archived by the owner on Jun 1, 2023. It is now read-only.

Pipeline cleanup #23

Merged
merged 13 commits into from
Aug 20, 2020
Merged

Pipeline cleanup #23

merged 13 commits into from
Aug 20, 2020

Conversation

limnoliver
Copy link
Member

@limnoliver limnoliver commented Jul 30, 2020

This PR cleans up several outstanding issues, including:

  1. Including new EcoSHEDS file sent by Jeff Walker on 7/23 that updates through June 2020.
  2. Moves all gets to getters.yml to avoid unnecessary builds per Alison's suggestion, and closes Move data targets to getters.yml to avoid double builds #19
  3. Follows new pull patterns developed in the national-flow-observations pipeline, which are mainly using readNWISuv and readNWISdv so we can use open start and end times to get all data (avoids having to reset dates, which I have forgotten in the past!). Also uses dummy dates to trigger repull, and uses those dates in pull IDs. This closes Follow new pull patterns in flow pipeline #18
  4. Limits NWIS pull to streams and springs, which avoids the largest lake/reservoir sites that were causing problems. This closes write a catch for uv sites with too much data #14 - and I suggest adding an NWIS pull to lake-temperature-model-prep to add value to that dataset, but to also limit lake pulls to only areas where you need them.
  5. Repulls the data!

Of course, tinkering around caused new issues:

  1. We are missing WQP sites that have no state label, because of a switch to having to loop through states to make a whatWQPdata call to retrieve observation counts to partition data pulls. Some of these should be missing because they are outside of the U.S. and would get filtered downstream anyway. However, some of these are in the U.S. and truly just missing state labels. See Missing WQP sites #22 for a description and possible solutions.
  2. UV data munge was causing memory failures in R. My solution was to reduce down to daily mean values in the combine step, so that the raw data is not preserved in the shared cache. I think this approach is ok, since we have a reproducible pipeline + are not using the raw data.

…asier/safer rebuilds, and modified calls to whatWQPdata since national pulls no longer work. Now query by state, counties if state fails.
… using readNWISdv and readNWISuv (instead of readNWISdata). This allows start and end dates to be set to "".
…xed algorithm for duplicate site/date records such that it ensures each date where a measurement was made is preserved.
…o handle data types. Wrote new error handler for POST. Fewer WQP sites was expected because we are not getting sites that don't have a state designated (~20k sites). Fewer NWIS sites expected because we are not pulling lakes/reservoirs.
…alues. I was running into memory issues when trying to reduce data after it was bound together. The downside of this is we don't keep high frequency data in the shared cache (just locally for whoever did the pull)
…mp column to use. UV data was causing memory bog downs when I was trying to filter later on on the bound data. This fixes memory issues, but does not preserve raw data.
Copy link
Member

@aappling-usgs aappling-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have time to fully, fully review the PR, but below are some discussion points, and I didn't spot anything that really needs changing.

"n_obs","n_sites"
51620919,355286
n_obs,n_sites
49640108,340901
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have this summary file in the repo. Is the loss of observations mostly due to losing those sites without state codes, you think? about 15k of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - this magnitude was expected based on my comparison with the old pull and missing state codes.

wqp_out <- try(wqp_post_try(wqp_args))

if (class(wqp_out) == 'try-error') {
message("Error with call to POST, trying dataRetrieval::readWQPdata")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: when does this error seem to occur, and when it does, does readWQPdata usually fix it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was fixing an old behavior (occasional fails of the POST call) that could probably be removed. It looked like all calls this time around were successful through POST (prints a message during the pull about which was used).


# first try the full state pull, wrapped in try so function does not fail
# with an error.
temp_dat <- try(wqp_call(whatWQPdata, wqp_args[c('characteristicName', 'statecode')]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'd someday be worth our time (and/or Laura's) to add some fault tolerance directly into dataRetrieval. Seems like to get through big pulls we always need stuff like this.

"n_sites","n_records","earliest","latest"
1223,1244622,1989-10-17,2019-12-31
n_sites,n_records,earliest,latest
1290,1339113,1989-10-17,2020-07-27
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm impressed that we gained 67 sites between December and July!

"n_obs","n_sites"
83725239,919
n_obs,n_sites
656498,617
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big loss in sites here. Is this due to no longer pulling lake data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's my only explanation for gain in uv inventory, but loss in data (sites get filtered after inventory).

out_ind = target_name)

4_other_sources/out/ecosheds_sites.rds.ind:
command: unzip_extract_sites(
zip_ind = '4_other_sources/in/sheds-public-data-20200622.zip.ind',
zip_ind = '4_other_sources/in/sheds-public-data-20200723.zip.ind',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful with committing the zip files themselves (below in this PR) - looks like you avoided it for 'sheds-public-data-20200723.zip'. If we start to hit repo size limits for this repo, I think '4_other_sources/in/sheds-public-data-20200622.zip', would be an excellent candidate for deep removal from the repo history (https://docs.github.com/en/github/authenticating-to-github/removing-sensitive-data-from-a-repository)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah saw that mistake as well, did not mean to commit the first time around!

!is.na(`ActivityDepthHeightMeasure/MeasureValue`) ~ `ActivityDepthHeightMeasure/MeasureUnitCode`,
is.na(`ActivityDepthHeightMeasure/MeasureValue`) & !is.na(`ResultDepthHeightMeasure/MeasureValue`) ~ `ResultDepthHeightMeasure/MeasureUnitCode`,
is.na(`ActivityDepthHeightMeasure/MeasureValue`) & is.na(`ResultDepthHeightMeasure/MeasureValue`) & !is.na(`ActivityTopDepthHeightMeasure/MeasureValue`) ~ `ActivityTopDepthHeightMeasure/MeasureUnitCode`,
is.na(`ActivityDepthHeightMeasure/MeasureValue`) & is.na(`ResultDepthHeightMeasure/MeasureValue`) & is.na(`ActivityTopDepthHeightMeasure/MeasureValue`) ~ `ActivityBottomDepthHeightMeasure/MeasureUnitCode`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guidance in ?case_when makes me think you could get away with a simpler code block above.

# Like an if statement, the arguments are evaluated in order, so you must
# proceed from the most specific to the most general. This won't work:
case_when(
  TRUE ~ as.character(x),
  x %%  5 == 0 ~ "fizz",
  x %%  7 == 0 ~ "buzz",
  x %% 35 == 0 ~ "fizz buzz"
)

So I'm fairly confident you could do

dat <- mutate(dat, sample_depth_unit_code = case_when(
     !is.na(`ActivityDepthHeightMeasure/MeasureValue`) ~ `ActivityDepthHeightMeasure/MeasureUnitCode`,
     !is.na(`ResultDepthHeightMeasure/MeasureValue`) ~ `ResultDepthHeightMeasure/MeasureUnitCode`,
     !is.na(`ActivityTopDepthHeightMeasure/MeasureValue`) ~ `ActivityTopDepthHeightMeasure/MeasureUnitCode`,
     TRUE ~ `ActivityBottomDepthHeightMeasure/MeasureUnitCode`
))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double checking with a more similar example:

> tibble(a=c(1, NA, NA, NA, 1), b=c(NA, 2, NA, 2, NA), c=c(NA, NA, 3, 3, 3)) %>%
+     mutate(best = case_when(!is.na(a) ~ a, !is.na(b) ~ b, !is.na(c) ~ c, TRUE ~ 0))
# A tibble: 5 x 4
      a     b     c  best
  <dbl> <dbl> <dbl> <dbl>
1     1    NA    NA     1
2    NA     2    NA     2
3    NA    NA     3     3
4    NA     2     3     2
5     1    NA     3     1

@wdwatkins
Copy link
Collaborator

@limnoliver I'm ready to run the pipeline with the new sites, if this PR is ready to go

@limnoliver limnoliver merged commit 1496ffd into USGS-R:master Aug 20, 2020
@limnoliver limnoliver deleted the pipeline_cleanup branch January 22, 2021 19:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants