-
-
Notifications
You must be signed in to change notification settings - Fork 134
Adding a new model to rstanarm
Update: please see the developer notes at mc-stan.org/rstanarm/dev-notes/ for more recent instructions than those on this page.
Adding a new model to the rstanarm package is not difficult, but it is tedious to make all the necessary changes to all the required files that are spread throughout the rstanarm package. The following attempts to enumerate all these details, using an instrumental variables model as an example.
Create a git branch for your work. If you use the command line version of git, it would be something like
git checkout master
git checkout -b feature/ivreg
where the name of the branch depends on the R function you are trying to emulate in Step 1.
The first step is to locate an existing function in R or an R package that you want to emulate in rstanarm. In fact, rstanarm abides by the principle that the stan_foo
function should specify the same likelihood of the data as the foo
function. Moreover, the core arguments of the foo
function that are necessary to specify the likelihood of the data should be arguments of the stan_foo
function with the same names, defaults, and ordering. In this example, we want to emulate the ivreg
function in the AER package, whose arguments are
> args(AER::ivreg)
function (formula, instruments, data, subset, na.action, weights,
offset, contrasts = NULL, model = TRUE, y = TRUE, x = FALSE,
...)
Consequently, when I create a stan_ivreg.R function in the R/ subdirectory of the rstanarm package, I start with
stan_ivreg <- function (formula, instruments, data, subset, na.action, weights,
offset, contrasts = NULL, model = TRUE, y = TRUE, x = FALSE,
...) # there will be more arguments
In the rstanarm package, we position the ...
to divide the arguments that we adopt from the emulated function from the arguments we add, which often pertain to the priors and cannot be abbreviated when called. In addition, the ...
serves an important purpose because any additional arguments are passed on to the underlying fitting function in the rstan package, which is usually sampling
. For example, the user could request 8 chains by calling stan_ivreg
with chains = 8
and chains
will be passed onto sampling
.
In order to guarantee that the likelihood of the data is the same in stan_ivreg
as in AER::ivreg
, it is necessary for stan_ivreg
to parse the data exactly like AER::ivreg
does. The easiest way to do so is often to call a minimal version of the emulated function, and this approach is all-but-necessary when the emulated function calls unexported functions to parse the data. Consequently, the first few lines of stan_ivreg
employ the usual R trick of creating a call, changing the symbol name of the function to be evaluated, and NULL
ifying unnecessary arguments.
mc <- match.call(expand.dots = FALSE)
mc$model <- mc$y <- mc$x <- TRUE
# NULLify any Stan specific arguments in mc now
mc[[1L]] <- quote(AER::ivreg)
if (!requireNamespace("AER")) stop("the AER package is needed by 'stan_ivreg'")
mf <- eval(mc, parent.frame())
At this point the mf
object is the list documented in the Value section of help(ivreg, package = "AER")
, so, I can use the x
, y
, and other elements of the mf
list inside stan_ivreg
.
In this case, the Stan program will need the data in a somewhat different format, so further processing will be necessary. However, it is good for x
and y
to be available in the original format for post-estimation functions.
In the case of AER::ivreg
, there is a closed-form solution involving matrix algebra only, so it is computationally cheap to call AER::ivreg
. In some cases, the emulated function will use numerical optimization to find the optimum, in which case it is best to call the emulated function with zero iterations if possible, which was done in the stan_glmer
function. If so, there is probably some internal function that defines the log-likelihood of the data (perhaps multiplied by -2 or something) that you can refer to when implementing your Stan program in Step 4. However, if you call a function in another package then you need to add that package --- in this case AER --- to the Suggests
section of the DESCRIPTION
file of rstanarm.
Many model-fitting functions in the stats package that come with R have a high-level version of the function that takes a formula
, a data.frame
, etc. and a low-level version of the function suffixed with .fit
that is more intended for programmers and inputs an outcome vector, a design matrix, etc. For example, there is glm
and glm.fit
. The same convention is adopted in the rstanarm package, even if the function being emulated does not have an exported .fit
counterpart. Thus, I create a stan_ivreg.fit
function in a stan_ivreg.fit.R file in the R/ directory of the rstanarm package whose initial arguments are the same as those in AER::ivreg
:
stan_ivreg.fit <- function (x, y, z, weights, offset, ...) { # there will be more arguments
stopifnot(is.matrix(X), is.vector(y), is.matrix(z), length(y) == nrow(X), nrow(X) == nrow(Z))
}
The stan_ivreg.fit
function will ultimately call the appropriate fitting function in the rstan package and return a stanfit
object to the calling function, which is typically stan_ivreg
. If there were no ivreg.fit
function in the AER package, we have more freedom to specify the required form of the arguments to the stan_ivreg.fit
function.
At this point, we are ready to write a Stan program for this model, which will define what objects the stan_ivreg.fit
function will need to pass to the data
block. While it is preferable to add to an existing .stan program if possible, in this case a new ivreg.stan file is needed because there is a matrix of endogenous variables (whereas the previously written .stan programs only take a vector or a one-dimensional array for the outcome). Some general rules for writing Stan programs that go into the rstanarm package are
- Put the Stan program into the exec/ subdirectory of the rstanarm package and use a .stan extension. In this case, it is ivreg.stan.
- As much as possible, utilize custom functions that are defined in the
functions
block of the Stan function because custom functions can (and must) be tested in R using theexpose_stan_functions
function in the rstan package. See Step 12 for details on that. - As much as possible, utilize include directives in the Stan program that point to re-usable chunks of Stan syntax, such as
#include 'make_eta.stan'
which forms the linear predictor in a generalized linear model. - If you need to make a new re-usable chunk of Stan syntax, put it under inst/chunks and use a .stan extension.
- Try to make the Stan program as general as possible, sacrificing readability if necessary.
- Calculate the sample average of the posterior predictive distribution in generated quantities.
The stan_ivreg.fit
function will return an object of S4 class stanfit
which is defined in the rstan package. The stan_ivreg
function needs to take the stanfit
object and use it to create a stanreg
object which is an S3 class. This should be similar to the stanreg
object produced by stan_glm
and / or similar to the object produced by the emulated function. In particular, its minimal list elements for are documented in help("stanreg-objects", package = "rstanarm")
.
Go back and add arguments to the main R functions after the ...
that input prior information about the parameters in the .stan program you wrote in Step 4. The ideal here is to have one hyperparameter that either takes a reasonable default value or can easily be required of the user, which conveys some weakly informative prior information. Beyond that you can add more prior options for advanced users, which may require some iteration between Step 4 and Step 6. Some priors are already implemented and documented in R/priors.R.
In the case of stan_ivreg
and stan_ivreg.fit
...
There is a print
method for a stanreg
object that uses branch logic to customize the output depending on what model-fitting function estimated it. In general, the output of print.stanreg
should resemble the output when print
or summary
is called on an object produced by the emulated function but there can be less or more output.
The predict
function can be used when the model is estimated by maximum (penalized) likelihood and utilizes the point estimates of the coefficients. The posterior_predict
method is preferable and is used when the model is estimated by MCMC or ADVI and utilizes all the draws. There is now a posterior_linpred
function that is exposed but mostly used internally to produce the posterior (not predictive) distribution of the linear predictor, which can then be used to generate the predictive distribution. There are helper functions to be written, but it should mostly be clear from the existing R/posterior_predict.R file.
The log_lik
function returns the log-likelihood for each observation, rather than summing them up over the entire dataset. This allows the log-likelihood to be evaluated for newdata
and with a set of draws from the posterior distribution. In turn, this allows for the Leave-One-Out (LOO) Information Criterion to be evaluated in a memory efficient way. Again, the helper functions in R/loo.R should provide enough guidance as to how to implement this for a new model.
The R package that contains the emulated function will often have post-estimation methods that users will be accustomed to calling. Thus, where appropriate, those same methods should be reimplemented in the rstanarm package. In the case of AER::ivreg
, candidate methods can be found by executing
methods(class = "ivreg")
[1] anova bread estfun hatvalues model.matrix predict print
[8] summary terms vcov
Of these, anova
, bread
, estfun
, and hatvalues
only make sense in a frequentist context where there is no prior information and can probably be ignored for the purposes of rstanarm. We have already covered the predict
and print
methods. There is already a general summary.stanreg
method so a customized one is usually not necessary. That leaves model.matrix
and terms
, which can be implemented simply by calling the appropriate functions in the AER package, and vcov
, which is calculated by the internal stanreg
function.
We use the roxygen2 package, so the documentation is embedded in the .R files. See the documentation for the roxygen2 package. If you need to use or create any snippets, they are in the man-roxygen/ subdirectory. This form of documentation allows you to embed examples, but remember that the examples for each function should execute in five seconds or less. This usually requires that the examples utilize ADVI rather than MCMC or (less preferably) have to be embedded in a \dontrun
block.
The tests/testthat directory contains unit tests. The test_stan_functions.R file compiles all the user-defined Stan functions in rstanarm, so be sure not to duplicate the name of an existing user-defined Stan function, which can be prevented by placing any user-defined Stan functions that are used in multiple Stan programs into inst/chunks/common_functions.stan. Add to test_stan_functions.R to ensure that the user-defined Stan functions you wrote produce the same output as their R counterparts.
You should create another file in tests/testthat called test_foo.R that tests the new functions more holistically. We often try to use optimization and no prior information here in order to compare the results to an existing frequentist R function. However, the results are often more different than one would think. Although it is suboptimal for unit-testing, it is a good idea to combine several things that you expect to pass into a single call in order to conserve time when calling R CMD check.
There are many examples in the vignettes/ subdirectory. We use the html_vignette format for output. Here we emphasize MCMC almost exclusively. Thus, the unit tests (ideally) involve optimization, the examples (reluctantly) involve ADVI, and the vignettes involve MCMC. There are some child chunks in the vignettes directory that you can include in your vignette.