Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ebirdchecklist() #108

Merged
merged 12 commits into from
Mar 23, 2024
Merged

Conversation

RichardLitt
Copy link
Contributor

Unfortunately deleted #107 by accident misusing force push, while trying to squash. This is another attempt at fixing #93.

@RichardLitt RichardLitt mentioned this pull request Feb 2, 2024
@slager
Copy link
Collaborator

slager commented Feb 2, 2024

Thanks for the work on this, @RichardLitt. Yeah I think force pushing can mess up PRs since GitHub may be trying to reference specific previous commits internally in PRs, and force pushing removes those.

If you see R CMD check errors in the CI jobs you can often also troubleshoot these locally with devtools::check()

@RichardLitt
Copy link
Contributor Author

I think I got it. I didn't know I needed to run devtools::document() as part of the build.

@RichardLitt
Copy link
Contributor Author

Let me know if anything needs editing.

@slager
Copy link
Collaborator

slager commented Feb 3, 2024

I took this for a spin locally and it's a great start, I have a few suggestions

  1. In the tests, expect_error(ebirdchecklist(invalid_checklist_id)) fails because this does not actually throw an error. Instead, it returns a data.frame containing the error description. There should be some code that checks the JSON to see if it is an error result, and throws an actual error there (e.g. with stop()).
  2. There should be some discussion with @sebpardo on the structure of the result, since the API returned object is rather complex. Currently it's returned from ebirdchecklist() as a nested tibble, where things like obs, obsAux, and subAux are returned nested. Is that what we want and is it consistent with the rest of the package outputs? Should these all be unnested into a wide data.frame? Or should the result be unnested and returned as a list containing a checklist-level table and an observation-level table? The latter might be the most tidy. Additional tests are probably needed to check the output structure once that's decided.
  3. The documentation doesn't include any description of what the function returns. Once item number 2 is addressed, this could be fleshed out, see other functions in the package for examples.
  4. @sebpardo?

@RichardLitt
Copy link
Contributor Author

Let me know what you think makes sense for the output. Happy to change as needed.

@RichardLitt
Copy link
Contributor Author

Small note: @sebpardo, I'm waiting on you for thoughts on this before moving forward. Let me know what you think!

@sebpardo
Copy link
Contributor

Hi @RichardLitt, apologies for the delay in looking into this. I just had played around with it locally and it looks great!

I agree with @slager that we need to think about how to return the output of this API call as the call itself returns a complex object. Saying this, I think it can be easily flattened into a data frame even in the special case of a checklist with breeding codes and/or attached media. See the following example:

library(dplyr)
cl <- ebirdchecklist("S101576784")
glimpse(cl)
## Rows: 17
## Columns: 22
## $ projId                      <chr> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD", "E…
## $ subId                       <chr> "S101576784", "S101576784", "S101576784", "S101…
## $ protocolId                  <chr> "P22", "P22", "P22", "P22", "P22", "P22", "P22"…
## $ locId                       <chr> "L17570646", "L17570646", "L17570646", "L175706…
## $ durationHrs                 <dbl> 3.367, 3.367, 3.367, 3.367, 3.367, 3.367, 3.367…
## $ allObsReported              <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ creationDt                  <chr> "2022-01-28 10:01", "2022-01-28 10:01", "2022-0…
## $ lastEditedDt                <chr> "2023-03-17 15:49", "2023-03-17 15:49", "2023-0…
## $ obsDt                       <chr> "2022-01-23 17:02", "2022-01-23 17:02", "2022-0…
## $ obsTimeValid                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ checklistId                 <chr> "CL30239", "CL30239", "CL30239", "CL30239", "CL…
## $ numObservers                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ effortDistanceKm            <dbl> 1.9, 1.9, 1.9, 1.9, 1.9, 1.9, 1.9, 1.9, 1.9, 1.…
## $ effortDistanceEnteredUnit   <chr> "km", "km", "km", "km", "km", "km", "km", "km",…
## $ subnational1Code            <chr> "CL-VS", "CL-VS", "CL-VS", "CL-VS", "CL-VS", "C…
## $ submissionMethodCode        <chr> "EBIRD_android", "EBIRD_android", "EBIRD_androi…
## $ submissionMethodVersion     <chr> "2.8.8_SDK28", "2.8.8_SDK28", "2.8.8_SDK28", "2…
## $ userDisplayName             <chr> "Sebastián Pardo", "Sebastián Pardo", "Sebastiá…
## $ numSpecies                  <int> 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17,…
## $ submissionMethodVersionDisp <chr> "2.8.8", "2.8.8", "2.8.8", "2.8.8", "2.8.8", "2…
## $ subAux                      <df[,4]> <data.frame[17 x 4]>
## $ obs                         <df[,13]> <data.frame[17 x 13]>
glimpse(cl$subAux)
## Rows: 17
## Columns: 4
##  $ subId           <chr> "S101576784", "S101576784", "S101576784", "S101576784", "S1…
##  $ fieldName       <chr> "nocturnal", "nocturnal", "nocturnal", "nocturnal", "noctur…
##  $ entryMethodCode <chr> "ebird_nocturnal", "ebird_nocturnal", "ebird_nocturnal", "e…
##  $ auxCode         <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",…
glimpse(cl$obs)
## Rows: 17
## Columns: 13
##  $ speciesCode      <chr> "ameoys", "whimbr", "kelgul", "perboo1", "neocor", "perpel…
##  $ hideFlags        <list> [], [], [], [], [], [], [], [], [], [], [], [], [], [], […
##  $ obsDt            <chr> "2022-01-23 17:02", "2022-01-23 17:02", "2022-01-23 17:02"…
##  $ subnational1Code <chr> "CL-VS", "CL-VS", "CL-VS", "CL-VS", "CL-VS", "CL-VS", "CL-…
##  $ howManyAtleast   <int> 2, 1, 20, 50, 3, 25, 2, 2, 1, 1, 2, 2, 2, 2, 1, 3, 4
##  $ howManyAtmost    <int> 2, 1, 20, 50, 3, 25, 2, 2, 1, 1, 2, 2, 2, 2, 1, 3, 4
##  $ subId            <chr> "S101576784", "S101576784", "S101576784", "S101576784", "S…
##  $ projId           <chr> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBI…
##  $ obsId            <chr> "OBS1328796217", "OBS1328796219", "OBS1328796215", "OBS132…
##  $ howManyStr       <chr> "2", "1", "20", "50", "3", "25", "2", "2", "1", "1", "2", …
##  $ present          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
##  $ mediaCounts      <df[,1]> <data.frame[17 x 1]>
##  $ obsAux           <list> <NULL>, <NULL>, <NULL>, [<data.frame[1 x 7]>], <NULL>, <NU…
glimpse(cl$obs$obsAux)
## List of 17
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ :'data.frame':	1 obs. of  7 variables:
##   ..$ subId          : chr "S101576784"
##   ..$ fieldName      : chr "breeding_code"
##   ..$ entryMethodCode: chr "api_breeding_code"
##   ..$ auxCode        : chr "FL"
##   ..$ obsId          : chr "OBS1328796224"
##   ..$ speciesCode    : chr "perboo1"
##   ..$ value          : chr "FL"
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
##  $ : NULL
glimpse(cl$obs$mediaCounts)
## Rows: 17
## Columns: 1
## $ P <int> NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA

The nested data frames .$subAux and .$obs can be easily joined to the main data frame as they have the same number of rows (Hopefully this is always the case!). Also note that we can simplify what we return and remove several columns as they are either duplicates or obsolete according to eBird's API documentation on this request:

In the fields for each observation, the following fields are duplicates or obsolete and will be removed at a future date: howManyAtleast, howManyAtmost, hideFlags, projId, subId, subnational1Code and present.

I also think we should ignore the columns beginning with submissionMethod* given they're about the observers' hardware instead of birds, and perhaps everything in .$subAux as well given it's not very informative and potentially wrong (I submitted that checklist today in the day but that output says it's a nocturnal checklist...).

With regards to the breeding code, I think the easiest solution would be to have an argument in the function along the lines of breeding_codes = TRUE, for which extra columns are returned with some the output of .$obs$obsAux, perhaps a column names breeding_code with a value equal to .$obs$obsAux$value? (and returning NAs for the species without that data). We could do the same with .$obs$mediaCounts, although I wouldn't be opposed to ignoring this output.

Haven't looked deep into the tests yet, but I tried entering two checklist identifiers and the function should throw an error with a more informative message:

cl2 <- ebirdchecklist(c("S162668032", "S162668033"))
### Error in parse_url(url) : length(url) == 1 is not TRUE

Lastly, I also agree with @slager in that the documentation needs to provide information on every column being returned.

Despite my long response, I don't think it should be too onerous to sort this out! @RichardLitt, let us know if you'd like any help with these suggestions.

@slager
Copy link
Collaborator

slager commented Feb 25, 2024

Here's an attempt to wrangle the returned object into a clean output data.frame:

cl <- ebirdchecklist("S101576784")

# extract sub df
col_is_df <- vapply(cl, is.data.frame, TRUE)
sub_df <- cl[1, !col_is_df]

# extract subAux df
subAux_df <- cl$subAux[1,]
# seems empty, and names conflict with breeding codes
subAux_df$auxCode <- NULL
subAux_df$entryMethodCode <- NULL

# extract obsAux df
obsAux_list <- cl$obs$obsAux
# find the list entry that contains the data
col_is_df <- vapply(obsAux_list, is.data.frame, TRUE)
obsAux_df <- obsAux_list[[which(col_is_df)]]
# redundant columns from sub_df
obsAux_df$subId <- NULL
obsAux_df$speciesCode <- NULL
# duplicate info with uninformative name
obsAux_df$value <- NULL
# names conflict with sub_df, and not very important
obsAux_df$fieldName <- NULL
obsAux_df$entryMethodCode <- NULL

# extract obs df
obs_df <- cl$obs
obs_df$obsAux <- NULL
# hideFlags might be useful, but its structure is currently undocumented
obs_df$hideFlags <- NULL
# remove redundant sub-level columns already in sub_df
obs_df$subnational1Code <- NULL
obs_df$obsDt <- NULL
obs_df$projId <- NULL
# mediaCounts appears to just be a nested integer vector (?)
obs_df$mediaCounts <- Reduce(c, obs_df$mediaCounts)

# join to get result df
sub_df %>%
  dplyr::left_join(subAux_df, by = 'subId') %>%
  dplyr::left_join(obs_df, by = 'subId') %>%
  dplyr::left_join(obsAux_df, by = 'obsId')


If something like this looks like a good start, I'm happy to put in a PR to ropensci/rebird once Richard's PR gets merged.

Note @RichardLitt that @sebpardo bumped the version again on master, so you'll want to merge recent changes to master back into your development branch.

@RichardLitt
Copy link
Contributor Author

If something like this looks like a good start, I'm happy to put in a PR to ropensci/rebird once Richard's PR gets merged.

Thanks, @slager. I think you'd still like me to make the above changes that @sebpardo suggested though, correct?

Thanks, @sebpardo, for the long response. One thing:

I also think we should ignore the columns beginning with submissionMethod* given they're about the observers' hardware instead of birds, and perhaps everything in .$subAux as well given it's not very informative and potentially wrong (I submitted that checklist today in the day but that output says it's a nocturnal checklist...).

I'm not sure it's a good practice to remove anything we get from the API in our wrapper. I think we should include it, even if it's potentially wrong, because it's what is gotten from the API and it'd be confusing to users to have that information not be there.

I'm happy to implement these changes, but I won't get to it this week, at least.

@slager
Copy link
Collaborator

slager commented Feb 27, 2024

@RichardLitt Yes, re: implementing the suggestions from @sebpardo

@sebpardo sebpardo mentioned this pull request Mar 6, 2024
@RichardLitt
Copy link
Contributor Author

I think I solved the issue with #108 (comment) problem 1. @sebpardo, it wasn't clear to me if you wanted to implement all of the changes mentioned in 2, as you started to in #108 (comment). What do you think?

@slager
Copy link
Collaborator

slager commented Mar 21, 2024

Great! I will try to find some time to take another look at this soon.

@slager
Copy link
Collaborator

slager commented Mar 23, 2024

Thanks much @RichardLitt, merging! Feel free to test and tweak this further, @sebpardo.

@slager slager merged commit bb965ab into ropensci:master Mar 23, 2024
8 checks passed
@sebpardo
Copy link
Contributor

Thank you, @RichardLitt, looks great!

@RichardLitt
Copy link
Contributor Author

Thank you, @sebpardo and @slager! I appreciate the extra effort as I waded through this.

We're using this endpoint as eBird admins to find checklists that slip past the normal filters, but which shouldn't be in the database due a variety of errors. Hopefully this addition will help us find those easier. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants