Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auk: eBird Data Extraction with AWK #136

Closed
11 of 14 tasks
mstrimas opened this issue Jul 19, 2017 · 56 comments
Closed
11 of 14 tasks

auk: eBird Data Extraction with AWK #136

mstrimas opened this issue Jul 19, 2017 · 56 comments

Comments

@mstrimas
Copy link

mstrimas commented Jul 19, 2017

Summary

  • What does this package do? (explain in 50 words or less):

Access to the eBird database, consisting of over 400 million observations, is provided via a huge (>150 GB) text file. The auk package extracts records from this file and imports them into R for analysis. Both presence only and presence/absence data can be generated.

  • Paste the full DESCRIPTION file inside a code block below:
Package: auk
Title: eBird Data Extraction with AWK
Version: 0.0.2.900
Date: 2017-07-05
Authors@R: c(
  person("Matthew", "Strimas-Mackey", email = "[email protected]", role = c("aut", "cre")),
  person("Eliot", "Miller", role = "aut"),
  person("Wesley", "Hochachka", role = "aut"),
  person("Cornell Lab of Ornithology", role = "cph")
  )
URL: https://github.com/CornellLabofOrnithology/auk, http://CornellLabofOrnithology.github.io/auk/
BugReports: https://github.com/CornellLabofOrnithology/auk/issues
Description: Extract and process bird sightings records from eBird 
    (<http://ebird.org>), an online tool for recording bird observations. 
    Public access to the full eBird database is via the eBird Basic Dataset 
    (EBD; see <http://ebird.org/ebird/data/download> for access), a downloadable 
    text file. This package is an interface to AWK for extracting data from the 
    EBD based on taxonomic, spatial, or temporal filters, to produce a 
    manageable file size that can be imported into R.
Depends: R (>= 3.1.0)
License: GPL-3
Encoding: UTF-8
LazyData: true
Imports:
    assertthat,
    stringr,
    stringi,
    magrittr,
    countrycode,
    tidyr
RoxygenNote: 6.0.1
Roxygen: list(markdown = TRUE)
Suggests:
    readr,
    data.table,
    knitr,
    rmarkdown,
    testthat,
    covr
VignetteBuilder: knitr

This package falls somewhere at the intersection of data retrieval and extraction. It provides access to the eBird database; however, it does so by processing a text file downloaded from eBird that contains the full database.

  • Who is the target audience?

Anyone looking to work with eBird data for science or conservation.

rebird provides access to eBird data via the eBird API; however, this only gives access to last 30 days of data. This package is the only one giving access to full eBird database.

Requirements

Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • has a CRAN and OSI accepted license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a vignette with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration, including reporting of test coverage, using services such as Travis CI, Coeveralls and/or CodeCov.
  • I agree to abide by ROpenSci's Code of Conduct during the review process and in maintaining my package should it be accepted.

Publication options

  • Do you intend for this package to go on CRAN?
  • Do you wish to automatically submit to the Journal of Open Source Software? If so:
    • The package contains a paper.md with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)

Detail

  • Does R CMD check (or devtools::check()) succeed? Paste and describe any errors or warnings:

  • Does the package conform to rOpenSci packaging guidelines? Please describe any exceptions:

  • If this is a resubmission following rejection, please explain the change in circumstances:

  • If possible, please provide recommendations of reviewers - those with experience with similar packages and/or likely users of your package - and their GitHub user names:

@karthik
Copy link
Member

karthik commented Jul 28, 2017

Hi @mstrimas,
Thank you for the submission. Sorry for the delay, but I am doing some initial editorial checks before locating suitable reviewers.

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Here are some notes from a goodpractice::gp() check (Please do run it yourself as you fix these issues or explain why you are unable to fix them).

It is good practice to

  ✖ write unit tests for all functions, and all package code
    in general. 80% of code lines are covered by test cases.

    R/auk-clean.r:50:NA
    R/auk-clean.r:51:NA
    R/auk-clean.r:52:NA
    R/auk-clean.r:53:NA
    R/auk-clean.r:54:NA
    ... and 161 more lines

  ✖ omit "Date" in DESCRIPTION. It is not required and it
    gets invalid quite often. A build date will be added to the package
    when you perform `R CMD build` on it.
  ✖ use '<-' for assignment instead of '='. '<-' is the
    standard, and R users and developers are used it and it is easier
    to read your code for them if you use '<-'.

    R/utils.r:20:13
    R/utils.r:74:39
    R/utils.r:75:36
    R/utils.r:77:15
    R/utils.r:83:39

  ✖ fix this R CMD check WARNING: LaTeX errors when creating
    PDF version. This typically indicates Rd problems.
  ✖ fix this R CMD check ERROR: Re-running with no
    redirection of stdout/stderr. Hmm ... looks like a package You may
    want to clean up by 'rm -rf /tmp/Rtmp6kBa0r/Rd2pdf2a896cf4bd4a'
──────────────────────────────────────────────────────────────────────────────── 
Warning messages:
1: In readLines(filename) :
  incomplete final line found on '/root/foo/auk/R/auk-rollup.r'
2: In readLines(filename) :
  incomplete final line found on '/root/foo/auk/R/read.r'
3: In readLines(filename) :
  incomplete final line found on '/root/foo/auk/tests/testthat/test_auk-rollup.r'
4: In readLines(filename) :
  incomplete final line found on '/root/foo/auk/tests/testthat/test_ebird-species.r'

@mstrimas
Copy link
Author

mstrimas commented Jul 31, 2017

Thanks, @karthik, I just fixed several of these, but the following remain:

  ✖ write unit tests for all functions, and all package code in general. 80% of code lines are covered
    by test cases.

    R/auk-clean.r:50:NA
    R/auk-clean.r:51:NA
    R/auk-clean.r:52:NA
    R/auk-clean.r:53:NA
    R/auk-clean.r:54:NA
    ... and 161 more lines

  ✖ fix this R CMD check WARNING: LaTeX errors when creating PDF version. This typically indicates Rd
    problems.
  ✖ fix this R CMD check ERROR: Re-running with no redirection of stdout/stderr. Hmm ... looks like a
    package You may want to clean up by 'rm -rf
    /var/folders/mg/qh40qmqd7376xn8qxd6hm5lwjyy0h2/T//RtmpsPAVpf/Rd2pdf5dcb4abe733d'
  • Test coverage: not sure where the 80% number comes from, I have 93% coverage on codecov.io, which I believe is good relative to other rOpenSci packages
  • For the R CMD check warning and error: these don't arise when I use devtools::check() or when I submit to CRAN, and I haven't had any luck figuring out what the issue is or how to resolve it

@karthik
Copy link
Member

karthik commented Aug 3, 2017

@mstrimas No worries. Thank you for fixing the warnings. Regarding those warnings, have you tried adding a blank line at the end to those files. That should make the warnings go away.

@karthik
Copy link
Member

karthik commented Aug 3, 2017

Reviewer 1 is @aurielfournier
Review due: August 23 (Auriel noted that she might need an additional week due to travel)

@karthik
Copy link
Member

karthik commented Aug 6, 2017

Reviewer 2 is @emhart
Review due: August 27

@aurielfournier
Copy link

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s) demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions in R help
  • Examples for all exported functions in R Help that run successfully locally
  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 6 (this is my first package review so I spent more time then I suspect I might on future reviews)


Reviewer Comments

auk does a great job of removing much of the pain and frustration of working with raw eBird data, which has been a limiting factor for many who want to take advantage of the vast data resources available through eBird. While there are other eBird packages, this is the only one I am aware of that allows you to work with the raw date downloaded from Cornell, as opposed to working with the summary data that can be gleaned from the eBird website.

The package is solid, though since I am not well versed in AWK by any means I'm unable to comment on those fine details.

The vignette is quite extensive, which is fantastic!

I'm reading this vignette thinking about 'the average ebird data user' who isn't necessarily someone with a extensive R background and so while this is a very detailed vignette, and I think the detail is good and important, it might be better if it was rearranged so that the heavy technical detail was towards the end, and the 'how to use this package' is more up front. Since heavy users will keep reading, but less experienced users may get overwhelmed by details that are not essential to them using the package.

My biggest suggestion would be to remove unlink() from all the function help examples and the vignette. If the user just runs the whole chunk at the same time, like I did the first several times, then the output file isn't there, since R just created it and then deleted it. I understand why you have it there to avoid having lots of files in your own directory, but I think keeping unlink() there it will create more issues then it solves, especially for less experienced R users.

I would encourage you to avoid abbreviations since most people aren't going to read the vignette word for word, and consider not using EBD, and just saying 'basic dataset' or something along those lines instead. It will be much more readable/skim-able this way.

Throughout the vignette and function documentation you use pipes, which is great, I like pipes, but lots of people don't. In some cases because they don't like them and in others because they find them confusing. I think it would be valuable to also include examples of how the functions would be used without pipes in the vignette and in the function specific help files.

Since it is not good practice to write over an object with the same name ebd, I would suggest editing your example to not do this, as it could cause issues for people running the examples piece meal and not following every step.

Function specific feedback

The function help examples in the different filter functions don't include auk_filter at the end of the pipeline. I think it would make sense to include auk_filter in all the examples since you mention in the function description that you need to include auk_filter to finish the process. That way the example is demonstrating the function within its full context.

Build/Install

I don't check build/installation on things very often. So this is not going to be the high point of my review. devtools::check() returned the following. If I am understanding this correctly there aren't any major issues on my machine.

Updating auk documentation
Loading auk
Setting env vars -----------------------------------------
CFLAGS  : -Wall -pedantic
CXXFLAGS: -Wall -pedantic
Building auk ---------------------------------------------
"C:/PROGRA~1/R/R-34~1.1/bin/x64/R" --no-site-file  \
  --no-environ --no-save --no-restore --quiet CMD build  \
  "C:\Users\amf698\Documents\R\win-library\3.4\auk"  \
  --no-resave-data --no-manual 

* checking for file 'C:\Users\amf698\Documents\R\win-library\3.4\auk/DESCRIPTION' ... OK
* preparing 'auk':
* checking DESCRIPTION meta-information ... OK
* checking whether 'INDEX' is up-to-date ... NO
* use '--force' to remove the existing 'INDEX'
* excluding invalid files
Subdirectory 'R' contains invalid file names:
  'auk' 'auk.rdb' 'auk.rdx'
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
Removed empty directory 'auk/R'
Removed empty directory 'auk/man'
WARNING: Removing directory 'auk/Meta' which should only occur in an
  installed package
WARNING: Removing directory 'auk/help' which should only occur in an
  installed package
WARNING: Removing directory 'auk/html' which should only occur in an
  installed package
* looking to see if a 'data/datalist' file should be added
* building 'auk_0.0.2.tar.gz'

Setting env vars -----------------------------------------
_R_CHECK_CRAN_INCOMING_ : FALSE
_R_CHECK_FORCE_SUGGESTS_: FALSE
Checking auk ---------------------------------------------
"C:/PROGRA~1/R/R-34~1.1/bin/x64/R" --no-site-file  \
  --no-environ --no-save --no-restore --quiet CMD check  \
  "C:\Users\amf698\AppData\Local\Temp\RtmpOCSZT8/auk_0.0.2.tar.gz"  \
  --as-cran --timings --no-manual 

* using log directory 'C:/Users/amf698/AppData/Local/Temp/RtmpOCSZT8/auk.Rcheck'
* using R version 3.4.1 (2017-06-30)
* using platform: x86_64-w64-mingw32 (64-bit)
* using session charset: ISO8859-1
* using options '--no-manual --as-cran'
* checking for file 'auk/DESCRIPTION' ... OK
* this is package 'auk' version '0.0.2'
* package encoding: UTF-8
* checking package namespace information ... OK
* checking package dependencies ... NOTE
Package suggested but not available for checking: 'covr'
* checking if this is a source package ... ERROR
Only *source* packages can be checked.
* DONE

Status: 1 ERROR, 1 NOTE
See
  'C:/Users/amf698/AppData/Local/Temp/RtmpOCSZT8/auk.Rcheck/00check.log'
for details.

R CMD check results
1 error  | 0 warnings | 1 note 
checking if this is a source package ... ERROR
Only *source* packages can be checked.

checking package dependencies ... NOTE
Package suggested but not available for checking: 'covr'

@emhart
Copy link

emhart commented Aug 24, 2017

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

  • As the reviewer I confirm that there are no conflicts of interest for me to review this work (If you are unsure whether you are in conflict, please speak to your editor before starting your review).

Documentation

The package includes all the following forms of documentation:

  • A statement of need clearly stating problems the software is designed to solve and its target audience in README

  • Installation instructions: for the development version of package and any non-standard dependencies in README

  • Vignette(s) demonstrating major functionality that runs successfully locally

  • Function Documentation: for all exported functions in R help

  • Examples for all exported functions in R Help that run successfully locally

  • Community guidelines including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software been confirmed.
  • Performance: Any performance claims of the software been confirmed.
  • Automated tests: Unit tests cover essential functions of the package
    and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines

Final approval (post-review)

  • The author has responded to my review and made changes to my satisfaction. I recommend approving this package.

Estimated hours spent reviewing: 5

Review Comments

The authors present a very elegant solution to a difficult problem in R, how to handle a very large data set such as the eBird data (larger than most people's personal computers could load into RAM) when most users only need a subset of the data in the full file. Their solution is to provide a way to automatically build an Awk script, execute it, and write a new output file. While this could be done without this R package, they make the data accessible to a much larger audience.

Overall I found the code to very comprehensive and elegant and think it will be a good addition to the rOpenSci package suite. The authors largely adhere to the rOpenSci package guidlines and are exceedingly diligent in their error handling in each function. Also I was impressed with their coverage use cases in their tests. They went above in beyond in writing an exhaustive suite of tests for each function. I find no major issues with how their code is written.

I do think there are a couple minor areas for improvement in making it easier for end users. The biggest issue I had was initially grokking that this was a multi-step workflow that involved writing a file to disk. My first impression was that I could simlply run a bunch of filters and the ebd variable (in the README) would actually be a dataframe. If there was a way to make the work flow more explicit, especially in the README, I think that would be helpful. Another thought I have is, would it be possible to obfuscate this multi-step process and have a function that loads up the ebd, runs the filters, writes the file and reads it all into a tibble? That way the end user could side-step ever running read_ebd(). Anoher minor issue I had was that there's somewhat mixed handling of what I think of as "user standards laziness". For instance, you insist on ISO date standards, but countries can be mixed case, and don't require the ISO country code. I see how on the one hand you're making it easier for users, but I found myself a bit confused about where I could skip on my standards when it came to input. I was honestly surprised when I could ender "gray jay" but not "Robin" (but "American Robin" worked fine).

Minor comments

  • Throughout your vignette and README you specify units for everything, but not extent. While I assume that it's decimal degrees, I think it would be good to be explicit.

  • You note that for this to work with windows they need cygwin installed in a specific directiory. I could forsee this being difficult, is there any way to specify the path to make it easier for windows users?

  • Consider adding a CITATION file so your package can be cited.

  • Would it be too much to include a column filter option to your workflow? Maybe that's a 0.0.3 feature, but seems like it would be a nice addition

  • There were some instances where you used system variable names as local variable names in functions, e.g. line 36 in auk_time.r time <- paste0(ifelse(nchar(time) == 4, "0", ""), time). It obviously doesn't cause an issue as is, but maybe it could down the line.

Community guidelines

A CONTRIBUTING or way to cotribute in the README is not present. Consider adding contributor guidelines.

Examples

I ran all examples using devtools::run_examples and all ran without error.

Tests

I ran all tests with devtools::test() and all tests were passed.

Checks

I built the package on the following system using devtools::test(cran = TRUE):
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

All checks were passed with no notes, errors, or warninging.

Test coverage

I checked for the amount of test coverage using covr::package_coverage() and it was 80.9%

Furthermore I reviewed all the tests in tests/testthat, not only was there good test coverage, the range of scenarios was exhaustive. I was very impressed with the breadth of cases tested.

sessionInfo()

Just so you can see what versions of packages I used to run my tests:

Session info ----------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.1 (2017-06-30)
 system   x86_64, darwin15.6.0        
 ui       RStudio (1.0.153)           
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/Los_Angeles         
 date     2017-08-23                  

Packages --------------------------------------------------------------------------------------------
 package      * version   date       source                                   
 assertthat     0.2.0     2017-04-11 CRAN (R 3.4.1)                           
 auk          * 0.0.2.901 <NA>       local                                    
 backports      1.1.0     2017-05-22 CRAN (R 3.4.1)                           
 base         * 3.4.1     2017-07-07 local                                    
 bindr          0.1       2016-11-13 CRAN (R 3.4.1)                           
 bindrcpp     * 0.2       2017-06-17 CRAN (R 3.4.1)                           
 callr          1.0.0     2016-06-18 CRAN (R 3.4.0)                           
 clisymbols     1.2.0     2017-08-24 Github (gaborcsardi/clisymbols@e49b4f5)  
 commonmark     1.2       2017-03-01 CRAN (R 3.4.1)                           
 compiler       3.4.1     2017-07-07 local                                    
 countrycode    0.19      2017-02-06 CRAN (R 3.4.0)                           
 covr         * 3.0.0     2017-06-26 CRAN (R 3.4.1)                           
 crayon         1.3.2     2016-06-28 CRAN (R 3.4.1)                           
 cyclocomp      1.1.0     2017-08-24 Github (MangoTheCat/cyclocomp@6156a12)   
 data.table     1.10.4    2017-02-01 CRAN (R 3.4.0)                           
 datasets     * 3.4.1     2017-07-07 local                                    
 desc           1.1.1     2017-08-03 CRAN (R 3.4.1)                           
 devtools     * 1.13.3    2017-08-02 CRAN (R 3.4.1)                           
 digest         0.6.12    2017-01-27 CRAN (R 3.4.1)                           
 dplyr          0.7.2     2017-07-20 CRAN (R 3.4.1)                           
 evaluate       0.10.1    2017-06-24 CRAN (R 3.4.1)                           
 glue           1.1.1     2017-06-21 CRAN (R 3.4.1)                           
 goodpractice   1.0.0     2017-08-24 Github (MangoTheCat/goodpractice@9969799)
 graphics     * 3.4.1     2017-07-07 local                                    
 grDevices    * 3.4.1     2017-07-07 local                                    
 hms            0.3       2016-11-22 CRAN (R 3.4.0)                           
 httr           1.3.1     2017-08-20 CRAN (R 3.4.1)                           
 igraph         1.1.2     2017-07-21 CRAN (R 3.4.1)                           
 jsonlite       1.5       2017-06-01 CRAN (R 3.4.1)                           
 knitr          1.17      2017-08-10 CRAN (R 3.4.1)                           
 lazyeval       0.2.0     2016-06-12 CRAN (R 3.4.0)                           
 lintr          1.0.1     2017-08-10 CRAN (R 3.4.1)                           
 magrittr       1.5       2014-11-22 CRAN (R 3.4.1)                           
 memoise        1.1.0     2017-04-21 CRAN (R 3.4.1)                           
 methods      * 3.4.1     2017-07-07 local                                    
 pkgconfig      2.0.1     2017-03-21 CRAN (R 3.4.1)                           
 praise         1.0.0     2015-08-11 CRAN (R 3.4.0)                           
 purrr          0.2.3     2017-08-02 CRAN (R 3.4.1)                           
 R6             2.2.2     2017-06-17 CRAN (R 3.4.1)                           
 rcmdcheck      1.2.1     2016-09-28 CRAN (R 3.4.0)                           
 Rcpp           0.12.12   2017-07-15 CRAN (R 3.4.1)                           
 readr          1.1.1     2017-05-16 CRAN (R 3.4.0)                           
 remotes        1.1.0     2017-07-09 CRAN (R 3.4.1)                           
 rex            1.1.1     2016-12-05 CRAN (R 3.4.0)                           
 rlang          0.1.2     2017-08-09 CRAN (R 3.4.1)                           
 roxygen2       6.0.1     2017-02-06 CRAN (R 3.4.1)                           
 rprojroot      1.2       2017-01-16 CRAN (R 3.4.1)                           
 rstudioapi     0.6       2016-06-27 CRAN (R 3.4.1)                           
 stats        * 3.4.1     2017-07-07 local                                    
 stringi        1.1.5     2017-04-07 CRAN (R 3.4.1)                           
 stringr        1.2.0     2017-02-18 CRAN (R 3.4.1)                           
 testthat     * 1.0.2     2016-04-23 CRAN (R 3.4.0)                           
 tibble         1.3.4     2017-08-22 CRAN (R 3.4.1)                           
 tidyr          0.7.0     2017-08-16 CRAN (R 3.4.1)                           
 tools          3.4.1     2017-07-07 local                                    
 utils        * 3.4.1     2017-07-07 local                                    
 whoami         1.1.1     2015-07-13 CRAN (R 3.4.0)                           
 withr          2.0.0     2017-07-28 CRAN (R 3.4.1)                           
 xml2           1.1.1     2017-01-24 CRAN (R 3.4.1)                           
 xmlparsedata   1.0.1     2016-06-18 CRAN (R 3.4.0)      

@emhart
Copy link

emhart commented Aug 24, 2017

@mstrimas As an aside from my review I wanted to say that this is a really cool solution to a big problem in R that I actually encounter in my work often. I might have a dataset that's a 20-30 GB and I don't want to actually crunch the whole thing in R. So I do a slightly more hacky approach which is to do some filtering in the data export phase (in SQL) and then some basic shell commands to sample / trim it down more, and then read it into R to do things like model POC. Do you think there's a way to make this package completely generic?

I'm imagining a scenario where I input the file location, column header names, a series of generic filters, and then the same basic workflow happens, awk executes the script and then writes an output file. Then this same workflow could work on any large text file. It seems like that would be a really powerful tool that would extend the functionality of this approach beyond eBird. Do you think that would be feasible?

@mstrimas
Copy link
Author

Thanks for all the helpful feedback! I'll start working through your suggestions and incorporating them.

@emhart yes, I think there is potential to make a more general AWK package for working with large files. In fact, I did originally considering doing that first, then making auk depend on the more general package, but just didn't have the time. there may also be better options than AWK that I'm not aware of... in any case, I think it would be useful to have a tool for processing text files that are too large to handle directly in R.

@karthik
Copy link
Member

karthik commented Aug 30, 2017

@aurielfournier A gentle ping 🙏

@karthik
Copy link
Member

karthik commented Aug 30, 2017

@aurielfournier Sorry I totally missed that your review was above Ted's. My apologies.

@aurielfournier
Copy link

No problem @karthik ! Our reviews came in withing a few hours of each other, easy to miss. I appreciate the gentle reminder, those are often necessary to keep me on top of things.

@mstrimas
Copy link
Author

mstrimas commented Sep 5, 2017

Finally getting to this, here are responses to @aurielfournier comments:

  • details later: I've put a "Quick Start" section at the top of the vignette that gives a simple example of how what the package does before delving into the details. I also put this section at the top of the readme. Does this address what you were getting at?
  • remove unlink: done!
  • don't use "EBD": replaced all instances of EBD with non-acronyms.
  • reusing variable names: fixed, no repeatedly re-using same variable name anymore.
  • use of pipes: I'm don't want every example in the readme and vignette to be duplicated, with and without pipes. So, I've just given one example at the top where I write the code both ways. Do you think I should write all examples in function documentation with and without pipes?
  • no auk_filter() in function help: this was by design. Since auk_filter() is the only function in the package that has an external dependency (i.e. AWK) that isn't installed on some systems (e.g. Windows), I decided to use it minimally in the help examples. I can include it, but then I'll need to enclose all examples in dontrun{} blocks. Maybe this isn't a big deal though. @emhart @aurielfournier, what are your thoughts, is having auk_filter() important enough warrant using dontrun{}?

Thanks for the code review!!!

@mstrimas
Copy link
Author

mstrimas commented Sep 5, 2017

Here are my responses to @emhart:

  • multi-step process: I agree, this is confusing, I've added a "Quick start" section at the top of the vignette and readme, which tries to explain this. Do you think this clarifies things?
  • wrapper function: I think an all in one function is an interesting idea. I initially wanted to have a clear distinction between defining filters and running filters because running on the full dataset takes several hours. also, it's possible the filtered text file will be huge and not easily readable into R, which is why reading is also separated out. I'll look in to a wrapper function that does everything at once though, could be handy for users.
  • user laziness: an interesting point that I hadn't thought of. I tried to allow users to be as lazy as possible without it being too hard to guess what they're doing. For dates, the user can provide a string (ISO format required) or a Date object. The latter option will allow users to user to first define dates in any format they like using as.Date(). Do you have any thoughts on how to guess which format the text date is in? Species are tough. "robin" doesn't work because there are so many different species of robin and no way to know you mean "american robin". More tricky is different common or scientific names for different species. Currently, for simplicity, the user must give the English common name or scientific name (case insensitive) as used in the eBird taxonomy. Ideally, we'd like to allow users to look up species using other taxonomies, names in different languages, alternate spellings, etc. but this is a daunting task that we probably won't get to for some time.
  • extent units: yes, decimal degrees. fixed.
  • cygwin on windows: Do you think allowing users to specify an environment variable would solve this?
  • citation file: added.
  • column filter: do you mean a way to specify a subset of columns to write to the text file? users can always dplyr::filter() after import, but I guess outputting fewer columns to the text file results in a smaller file size. I can think about this.
  • time variable: oops, fixed this.
  • CONTRIBUTING: I don't have a CONTRIBUTING file but I do have a CONDUCT file. is this that same, should I rename this CONTRIBUTING?

Thanks!!!

@mstrimas
Copy link
Author

mstrimas commented Sep 6, 2017

@emhart Just added ability to manually set awk path by setting the AWK_PATH environment variable in .Renviron. Should work on Mac or Windows, though I don't have a Windows machine to test.

@aurielfournier
Copy link

@mstrimas

The Quick Start is great, exactly what I was looking for.

I think one example is sufficient for the with and without pipes.

I see what you mean about auk_filter() now, I guess I could go either way on that one. Do you have thoughts @emhart ?

@mstrimas
Copy link
Author

@aurielfournier I've added pipe-free examples to all functions for pipe haters.

@karthik what's the next step here?

The eBird taxonomy and EBD was just updated and my intention is to submit a new version of auk to CRAN in the next few days reflecting the taxonomy changes and the suggestion from the reviewers.

@emhart
Copy link

emhart commented Sep 26, 2017

Sorry for the delay @mstrimas here are a few quick thoughts:

  • Do you think this clarifies things? - yes
  • Do you have any thoughts on how to guess which format the text date is in? - No, not really, honestly I'm impressed with the ease of use built in. I would usually specify a format, and then do a check for the format and throw an error, I don't think you need to guess
  • Do you think allowing users to specify an environment variable would solve this? - Sounds like the answer is yes, although I also don't have a windows machine
  • do you mean a way to specify a subset of columns to write to the text file? - yes, that is what I meant, I guess I was thinking since you're writing it out to disk, instead of having the user read all the data in, why not let them just write the columns they want. I think this falls under the "nice to have" features, not a necessary one
  • is this that same, should I rename this CONTRIBUTING? - My understanding is that the CONTRIBUTING file gives guidelines for how users can contribute to the project . Here is an example vs a CONDUCT file

@mstrimas
Copy link
Author

@emhart thanks! I like the idea of column subsetting and will start looking at the best way to implement that.

@mstrimas
Copy link
Author

mstrimas commented Oct 3, 2017

auk_filter() now has additional arguments keep and drop that users can use to specify which columns are output.

@mstrimas
Copy link
Author

@karthik just released a new version to CRAN with most of the changes suggested by the review process included, as well as a variety of other new features and bug fixes. let me know what the next steps are to get this up on rOpenSci. Thanks!

@karthik
Copy link
Member

karthik commented Dec 4, 2017

@emhart @aurielfournier Could you two take a look at the recent updates and let me know if you are ready to sign off? 🙏

@aurielfournier
Copy link

I will do my best to get to this in the next two weeks. This are a bit swamped on my end at the moment.

@karthik
Copy link
Member

karthik commented Dec 8, 2017

Thank you @aurielfournier! much appreciated. 🙏

@karthik
Copy link
Member

karthik commented Jan 30, 2018

Congrats on your package being accepted @mstrimas! 🎉 🎈
And a huge thanks to @aurielfournier and @emhart for their expertise and time on this review! 🙏

Here are your next steps:

  • Please add a badge to indicate the peer review status.
[![](https://badges.ropensci.org/136_status.svg)](https://github.com/ropensci/onboarding/issues/136)

Please also add a footer to the bottom of your README

[![](http://www.ropensci.org/public_images/github_footer.png)](http://ropensci.org)
  • Please accept my invitation to join a team on the ropensci github org. This will allow you to transfer the repo. Once it's transferred, I'll give you write access there so you can update the CI badges.
  • Fix any links in badges for CI and coverage to point to the ropensci URL. (We'll turn on the services on our end as needed)

Once moved, please re-run all checks in preparation for submission to CRAN. I can help with this if you run into any issues.

Welcome aboard! We'd also love a blog post about your package, either a short-form intro to it (https://ropensci.org/tech-notes/) or long-form post with more narrative about its development. ((https://ropensci.org/blog/). If you are, @stefaniebutland will be in touch about content and timing.

@mstrimas
Copy link
Author

Hi @karthik,
In my original conversations with @noamross we agreed to host the package on the Cornell Lab of Ornithology's GitHub page and have a read only mirror at rOpenSci. I've added the footer and badge to the readme; however, our intention is to keep links and CI pointing to our organizations page. What's the best way to proceed with setting up a read only mirror on rOpenSci? Thanks!

@karthik
Copy link
Member

karthik commented Jan 30, 2018

@mstrimas Hi Matt, understood. You can skip the transfer step and I'll look into the best way for setting up a mirror and get back to you with further details.

@stefaniebutland
Copy link
Member

Hello @mstrimas. Congratulations on auk acceptance! We would love to host a post about it, so if you're interested, have a look at the editorial and technical info here https://github.com/ropensci/roweb2#contributing-a-blog-post and let me know if you are considering it.

@mstrimas
Copy link
Author

@stefaniebutland sure, I'm happy to modify the vignette into a blog post. Probably won't be able to get to it for a week or two though.

@stefaniebutland
Copy link
Member

No problem @mstrimas. I have Tues Feb 27 available for a post and we typically ask for a draft for review at least a week before the post date. What do you think about Tues Feb 20 to submit a draft via pull request?

@mstrimas
Copy link
Author

That works for me, @stefaniebutland, thanks!

@karthik
Copy link
Member

karthik commented Feb 13, 2018

@mstrimas Just wanted to give you a quick update. We are still working out a good way to do the mirroring, but I'll let you know soon. Also I am about to travel for a bit, so I'll update the thread upon my return (late Feb). 🙏

@mstrimas
Copy link
Author

Hi @stefaniebutland, looks like there's going to be a large update to the data underlying this package in mid March that will requiring some changes to the package that break backward compatibility. Would you be open to pushing the blog post back until the new version is released? If this messes things up for you, no worries, I can proceed with the post as is and just avoid the features that will get broken.

@stefaniebutland
Copy link
Member

@mstrimas It's whatever you think is best. Blog post timing is flexible. You only really get the audience once for this kind of thing though, so if you prefer to publish after updates to avoid frustrating people once they're engaged, then we can postpone. Perhaps you can draft your post ideas for yourself now before they go stale ;-) and then fill in soon after you've made the required changes to the package.

I'll mark my calendar to check in with you in late March.

@mstrimas
Copy link
Author

Ok, that's what I'm thinking, only one chance to catch people's eye so better to have the package in tip top shape. Late March sounds good. Thanks!

@stefaniebutland
Copy link
Member

@mstrimas

looks like there's going to be a large update to the data underlying this package in mid March

Checking in to see if timing is right for to draft a blog post - no rush if pkg not updated yet

@maelle
Copy link
Member

maelle commented Apr 18, 2018

By the way, it'd be grand if the blog post explained a bit how to choose between using auk and @sckott's rebird depending on the use case. 😺 The information could also be in the READMEs of both packages. Thinking of this because this week I was at a loss which of the two to recommend. 😀

@mstrimas
Copy link
Author

@stefaniebutland I think the package is ready, I can start putting something together this week. Thanks for the reminder!

@maelle rebird is an interface to the eBird API, which gives access to a very limited subset of the data, e.g. the last 30 days of observations from a location. I think of rebird as being useful for building tools and visualizations for birders; however, for most ecological applications (e.g. distribution modeling) you'll want access to the full eBird database (~500 million records).

@maelle
Copy link
Member

maelle commented Apr 18, 2018

Thanks a lot for the explanations @mstrimas! It'd be a nice footnote as well in my opinion (of the post and READMEs).

When you say 30 days of observation you mean for raw occurrence data right? For frequency derived from it it seems you can get older data e.g. https://github.com/stephhazlitt/ruhu-ebird-observations/blob/master/R/ruhu-ebird-observations.md

@mstrimas
Copy link
Author

mstrimas commented Apr 18, 2018

@maelle I wasn't aware of the ebirdfreq() function, that's cool! Seems all the other functions are "recent" observations, but that one does give access to historical data at state, county, and hotspot level. It's also worth noting the rebird is easier to use and much faster, so if your data needs can be met by rebird, I'd say it's definitely preferred.

I'll add something to the README explaining the difference, thanks for the suggestion!

@maelle
Copy link
Member

maelle commented Apr 18, 2018

Awesome! It'll be super useful to guide users finding any of the 2 packages first! I wonder if the info should also live in the vignette because of people installing from CRAN and therefore not having the README 🤔

@mstrimas
Copy link
Author

@maelle updated the README and vignette as per your suggestion

@maelle
Copy link
Member

maelle commented Apr 18, 2018

Fantastic! Speaking of other rOpenSci packages, I am also wondering whether/how one could use bowerbird (not an ornithology package despite its name) and auk to keep, update and use a local copy of eBird dataset, I might ping you if I ever try to write such an use case.

@mstrimas
Copy link
Author

@stefaniebutland here's a first draft of a blog post.

What topicid and date should I use? Also, is there somewhere in the website repo I can put a couple data files (~ 3 MB). If there isn't a good spot, I'll just leave them in my GitHub repo.

If this looks good I can submit a pull request to the rOpenSci website repo.

@stefaniebutland
Copy link
Member

@mstrimas I apologize for my long delay in responding. I am temporarily putting most blog post reviews and scheduling on hold until after our unconference. Rest assured your post will get a proper review and be published on our blog. I anticipate getting to this in early June.

What topicid and date should I use?

Leave topicid blank. I add that immediately before publication and that links comments to our discussion forum. We'll determine the date when I review your draft. Sorry about this delay. We really appreciate it when package authors do the extra work of contributing a post!

@stefaniebutland
Copy link
Member

Hi @mstrimas. I hope you're still up for a post about auk. Right now I have a spot open for publication on 2018-07-24 so you could use that in your draft and if an earlier spot opens we can publish sooner and change the date.

is there somewhere in the website repo I can put a couple data files (~ 3 MB). If there isn't a good spot, I'll just leave them in my GitHub repo.

Scott says good to leave them in your GitHub repo.

Happy to answer any questions

@stefaniebutland
Copy link
Member

@mstrimas Are you still interested in writing a post or a short tech note about auk?

@mstrimas
Copy link
Author

Sorry @stefaniebutland, I've been out of town off and on the last couple weeks, and won't be back until the end of next week, so this slipped through the cracks. If you feel what I already put together looks good for a blog post, I'd be happy to use that: https://github.com/mstrimas/auk-blog-post/blob/master/auk.md

What needs to be done to that to get it ready for posting?

@stefaniebutland
Copy link
Member

That reads nicely as a blog post @mstrimas. Please read on when you have time.

I have a couple of suggestions for edits. When you're ready, please submit a pull request as outlined here: https://github.com/ropensci/roweb2#contributing-a-blog-post. Use date 2018-08-07 and if it is ready sooner and a spot opens up, we can change the date and publish sooner.

  • would be helpful at start to note some of the types of info eBird contains - after this text I think "eBird is an online tool for recording bird observations. The eBird database currently contains over 500 million records of bird sightings, spanning every country and over 98% of species, making it an extremely valuable resource for bird research and conservation. " If not too much, it might entice readers if you give an example or two of the types of things birders would want to know from the data. Maybe refer to the examples you give at the end e.g. map species, so the reader knows there's a nice outcome at the end of the drier parts of the code explanations.

  • In 2nd para of Accessing the data, the word "access" is used a lot. I know it's necessary but maybe can be reduced.

  • Perhaps number the headings according to the numbers of 5 tasks under "Using auk".

  • turn "Applications", "Presence-only data", "Zero-filled data" into more enticing headings that relate to your examples

  • "One of the most obvious things to do with the presence data is make a map!" Please explain what your examples show with a line or two of text after the images. And be clear what you mean by effort in "The above map doesn't account for effort"

  • I see you have a YouTube video about using auk. Would be good to find a spot to link to it.

@maelle is going to refer to using auk in an upcoming blog post so she will also review your draft, possibly once it's submitted as a pull request.

Hope this is helpful. Happy to answer any questions.

@maelle
Copy link
Member

maelle commented Jul 12, 2018

Ahah no I actually reviewed the post simultaneously, great minds meet @stefaniebutland! Thanks again for your interesting post @mstrimas, my comments are over at https://github.com/mstrimas/auk-blog-post/issues/1

@karthik karthik closed this as completed Aug 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants