Skip to content
This repository has been archived by the owner on Jun 1, 2023. It is now read-only.

Wqp fixes #58

Merged
merged 6 commits into from
Apr 25, 2023
Merged

Wqp fixes #58

merged 6 commits into from
Apr 25, 2023

Conversation

limnoliver
Copy link
Member

Some WQP fixes that the trends group pointed out to us. It handles the fields StatisticalBaseCode as well as some date parsing issues, where cooperators were reporting means across several dates, or dates were reported as the start of the project rather than the date of collection. This PR rebuilds all data downstream of these changes.

…s only min/mean/max statcodes and uses them appropriately, drops observations where activity start date does not equal end date (except where there is only one obs per site-date) or resolves date issues when there are collection dates in comments. Also does not calculate min and max when there is only a single observation.
@padilla410
Copy link
Collaborator

@lekoenig, FYI

@limnoliver limnoliver requested a review from lekoenig March 16, 2023 19:08
@limnoliver
Copy link
Member Author

@lekoenig, FYI

@lekoenig -- are you interested in reviewing these changes? Thought that would make sense since you might look at them anyway :). No rush, and let me know if I should pass along to someone else.

@lekoenig
Copy link
Collaborator

@lekoenig -- are you interested in reviewing these changes?

Sure, I can definitely take a look. Does a review early next week work for you, @limnoliver?

@limnoliver
Copy link
Member Author

@lekoenig -- are you interested in reviewing these changes?

Sure, I can definitely take a look. Does a review early next week work for you, @limnoliver?

Yep, sure does. Thanks Lauren.

Copy link
Collaborator

@lekoenig lekoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, @limnoliver! I reviewed the code but did not run it myself or inspect the output since I do not have access to the shared data cache for this repo. I've added a few questions - mostly for my own reference. I made one suggestion but it is very minor and shouldn't change the output so I've gone ahead and approved this PR.

For reference, do you mind adding somewhere in this PR what sort of numbers you're seeing as a result of these code changes (percent or number of records dropped at the national scale)?

1_wqp_pull/src/get_wqp_data.R Show resolved Hide resolved
5_data_munge/src/munge_wqp_files.R Show resolved Hide resolved
temperature_min_daily = min(ResultMeasureValue),
temperature_max_daily = max(ResultMeasureValue),
n_obs = n(),
dat_reduced_statcode <- ungroup(dat_reduced) %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the grouping structure for dat_reduced? You could consider moving the ungroup() step to line 95 to avoid repeating that step in line 96 and then again in line 109.

5_data_munge/src/munge_wqp_files.R Show resolved Hide resolved

resolve_statcodes <- function(in_ind, out_ind) {
dat <- readRDS(sc_retrieve(in_ind, remake_file = 'getters.yml')) %>%
ungroup() %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment above, I wonder about ungrouping the data frame earlier on in the pipeline. Is it expected that '5_data_munge/out/wqp_data_streams.rds.ind' has some grouping structure that should be maintained?

This is a minor formatting suggestion and I don't expect it to change the output of the pipeline, so feel free to take or leave this suggestion.

5_data_munge/src/munge_wqp_files.R Show resolved Hide resolved
message(paste(nrow_o - nrow(dat), 'observations were dropped due to estimation, blank correction, or statcode that was not mean, min, max'))
# for some data, the start and end dates are different, and data providers
# seem to be using these as a date range of the whole dataset
# sometimes, the proper collection date is in the comment field
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes, the proper collection date is in the comment field

😢

`ActivityStartTime/Time` = newActivityStartTime)

# print message about date recoveries
message(paste(nrow(range_dates), 'observations with mismatching start/end dates were recovered by extracting collection dates from comments'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, way to recover this info!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity: 2115448 observations with mismatching start/end dates were recovered by extracting collection dates from comments

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that number is higher than I was expecting!

5_data_munge/src/munge_wqp_files.R Show resolved Hide resolved
…have to read in big file twice. Not sure why indicators rebuilt, but confident there weren't downstream chances because I did an scmake on the final traget in 5_munge and it did not rebuild.
@limnoliver limnoliver merged commit 21528cb into USGS-R:main Apr 25, 2023
@limnoliver limnoliver deleted the wqp_fixes branch April 25, 2023 18:05
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants