-
Notifications
You must be signed in to change notification settings - Fork 7
Conversation
…asier/safer rebuilds, and modified calls to whatWQPdata since national pulls no longer work. Now query by state, counties if state fails.
… using readNWISdv and readNWISuv (instead of readNWISdata). This allows start and end dates to be set to "".
…xed algorithm for duplicate site/date records such that it ensures each date where a measurement was made is preserved.
…o handle data types. Wrote new error handler for POST. Fewer WQP sites was expected because we are not getting sites that don't have a state designated (~20k sites). Fewer NWIS sites expected because we are not pulling lakes/reservoirs.
…alues. I was running into memory issues when trying to reduce data after it was bound together. The downside of this is we don't keep high frequency data in the shared cache (just locally for whoever did the pull)
…mp column to use. UV data was causing memory bog downs when I was trying to filter later on on the bound data. This fixes memory issues, but does not preserve raw data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have time to fully, fully review the PR, but below are some discussion points, and I didn't spot anything that really needs changing.
"n_obs","n_sites" | ||
51620919,355286 | ||
n_obs,n_sites | ||
49640108,340901 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to have this summary file in the repo. Is the loss of observations mostly due to losing those sites without state codes, you think? about 15k of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - this magnitude was expected based on my comparison with the old pull and missing state codes.
wqp_out <- try(wqp_post_try(wqp_args)) | ||
|
||
if (class(wqp_out) == 'try-error') { | ||
message("Error with call to POST, trying dataRetrieval::readWQPdata") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: when does this error seem to occur, and when it does, does readWQPdata usually fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was fixing an old behavior (occasional fails of the POST call) that could probably be removed. It looked like all calls this time around were successful through POST (prints a message during the pull about which was used).
|
||
# first try the full state pull, wrapped in try so function does not fail | ||
# with an error. | ||
temp_dat <- try(wqp_call(whatWQPdata, wqp_args[c('characteristicName', 'statecode')])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it'd someday be worth our time (and/or Laura's) to add some fault tolerance directly into dataRetrieval. Seems like to get through big pulls we always need stuff like this.
"n_sites","n_records","earliest","latest" | ||
1223,1244622,1989-10-17,2019-12-31 | ||
n_sites,n_records,earliest,latest | ||
1290,1339113,1989-10-17,2020-07-27 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm impressed that we gained 67 sites between December and July!
"n_obs","n_sites" | ||
83725239,919 | ||
n_obs,n_sites | ||
656498,617 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big loss in sites here. Is this due to no longer pulling lake data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's my only explanation for gain in uv inventory, but loss in data (sites get filtered after inventory).
out_ind = target_name) | ||
|
||
4_other_sources/out/ecosheds_sites.rds.ind: | ||
command: unzip_extract_sites( | ||
zip_ind = '4_other_sources/in/sheds-public-data-20200622.zip.ind', | ||
zip_ind = '4_other_sources/in/sheds-public-data-20200723.zip.ind', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful with committing the zip files themselves (below in this PR) - looks like you avoided it for 'sheds-public-data-20200723.zip'. If we start to hit repo size limits for this repo, I think '4_other_sources/in/sheds-public-data-20200622.zip', would be an excellent candidate for deep removal from the repo history (https://docs.github.com/en/github/authenticating-to-github/removing-sensitive-data-from-a-repository)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah saw that mistake as well, did not mean to commit the first time around!
!is.na(`ActivityDepthHeightMeasure/MeasureValue`) ~ `ActivityDepthHeightMeasure/MeasureUnitCode`, | ||
is.na(`ActivityDepthHeightMeasure/MeasureValue`) & !is.na(`ResultDepthHeightMeasure/MeasureValue`) ~ `ResultDepthHeightMeasure/MeasureUnitCode`, | ||
is.na(`ActivityDepthHeightMeasure/MeasureValue`) & is.na(`ResultDepthHeightMeasure/MeasureValue`) & !is.na(`ActivityTopDepthHeightMeasure/MeasureValue`) ~ `ActivityTopDepthHeightMeasure/MeasureUnitCode`, | ||
is.na(`ActivityDepthHeightMeasure/MeasureValue`) & is.na(`ResultDepthHeightMeasure/MeasureValue`) & is.na(`ActivityTopDepthHeightMeasure/MeasureValue`) ~ `ActivityBottomDepthHeightMeasure/MeasureUnitCode` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This guidance in ?case_when
makes me think you could get away with a simpler code block above.
# Like an if statement, the arguments are evaluated in order, so you must
# proceed from the most specific to the most general. This won't work:
case_when(
TRUE ~ as.character(x),
x %% 5 == 0 ~ "fizz",
x %% 7 == 0 ~ "buzz",
x %% 35 == 0 ~ "fizz buzz"
)
So I'm fairly confident you could do
dat <- mutate(dat, sample_depth_unit_code = case_when(
!is.na(`ActivityDepthHeightMeasure/MeasureValue`) ~ `ActivityDepthHeightMeasure/MeasureUnitCode`,
!is.na(`ResultDepthHeightMeasure/MeasureValue`) ~ `ResultDepthHeightMeasure/MeasureUnitCode`,
!is.na(`ActivityTopDepthHeightMeasure/MeasureValue`) ~ `ActivityTopDepthHeightMeasure/MeasureUnitCode`,
TRUE ~ `ActivityBottomDepthHeightMeasure/MeasureUnitCode`
))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double checking with a more similar example:
> tibble(a=c(1, NA, NA, NA, 1), b=c(NA, 2, NA, 2, NA), c=c(NA, NA, 3, 3, 3)) %>%
+ mutate(best = case_when(!is.na(a) ~ a, !is.na(b) ~ b, !is.na(c) ~ c, TRUE ~ 0))
# A tibble: 5 x 4
a b c best
<dbl> <dbl> <dbl> <dbl>
1 1 NA NA 1
2 NA 2 NA 2
3 NA NA 3 3
4 NA 2 3 2
5 1 NA 3 1
@limnoliver I'm ready to run the pipeline with the new sites, if this PR is ready to go |
This PR cleans up several outstanding issues, including:
gets
togetters.yml
to avoid unnecessary builds per Alison's suggestion, and closes Move data targets to getters.yml to avoid double builds #19readNWISuv
andreadNWISdv
so we can use open start and end times to get all data (avoids having to reset dates, which I have forgotten in the past!). Also uses dummy dates to trigger repull, and uses those dates in pull IDs. This closes Follow new pull patterns in flow pipeline #18lake-temperature-model-prep
to add value to that dataset, but to also limit lake pulls to only areas where you need them.Of course, tinkering around caused new issues:
whatWQPdata
call to retrieve observation counts to partition data pulls. Some of these should be missing because they are outside of the U.S. and would get filtered downstream anyway. However, some of these are in the U.S. and truly just missing state labels. See Missing WQP sites #22 for a description and possible solutions.