-
-
Notifications
You must be signed in to change notification settings - Fork 34
/
Copy pathloo2-with-rstan.Rmd
234 lines (182 loc) · 8.19 KB
/
loo2-with-rstan.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
---
title: "Writing Stan programs for use with the loo package"
author: "Aki Vehtari and Jonah Gabry"
date: "`r Sys.Date()`"
output:
html_vignette:
toc: yes
params:
EVAL: !r identical(Sys.getenv("NOT_CRAN"), "true")
---
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Writing Stan programs for use with the loo package}
-->
```{r settings, child="children/SETTINGS-knitr.txt"}
```
```{r, child="children/SEE-ONLINE.txt", eval = if (isTRUE(exists("params"))) !params$EVAL else TRUE}
```
# Introduction
This vignette demonstrates how to write a Stan program that computes and stores
the pointwise log-likelihood required for using the __loo__ package. The other
vignettes included with the package demonstrate additional functionality.
Some sections from this vignette are excerpted from our papers
* Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. _Statistics and Computing_. 27(5), 1413--1432. \doi:10.1007/s11222-016-9696-4. Links: [published](https://link.springer.com/article/10.1007/s11222-016-9696-4) | [arXiv preprint](https://arxiv.org/abs/1507.04544).
* Vehtari, A., Simpson, D., Gelman, A., Yao, Y., and Gabry, J. (2019). Pareto smoothed importance sampling. [arXiv preprint arXiv:1507.04544](https://arxiv.org/abs/1507.04544).
which provide important background for understanding the methods implemented in
the package.
# Example: Well water in Bangladesh
This example comes from a survey of residents from a small area in Bangladesh
that was affected by arsenic in drinking water. Respondents with elevated
arsenic levels in their wells were asked if they were interested in getting
water from a neighbor's well, and a series of logistic regressions were fit to
predict this binary response given various information about the households
(Gelman and Hill, 2007). Here we fit a model for the well-switching response
given two predictors: the arsenic level of the water in the resident's home, and
the distance of the house from the nearest safe well.
The sample size in this example is $N=3020$, which is not huge but is large
enough that it is important to have a computational method for LOO that is fast
for each data point. On the plus side, with such a large dataset, the influence
of any given observation is small, and so the computations should be stable.
## Coding the Stan model
Here is the Stan code for fitting the logistic regression model, which
we save in a file called `logistic.stan`:
```
data {
int<lower=0> N; // number of data points
int<lower=0> P; // number of predictors (including intercept)
matrix[N,P] X; // predictors (including 1s for intercept)
int<lower=0,upper=1> y[N]; // binary outcome
}
parameters {
vector[P] beta;
}
model {
beta ~ normal(0, 1);
y ~ bernoulli_logit(X * beta);
}
generated quantities {
vector[N] log_lik;
for (n in 1:N) {
log_lik[n] = bernoulli_logit_lpmf(y[n] | X[n] * beta);
}
}
```
We have defined the log likelihood as a vector named `log_lik` in the generated
quantities block so that the individual terms will be saved by Stan. After
running Stan, `log_lik` can be extracted (using the `extract_log_lik` function
provided in the **loo** package) as an $S \times N$ matrix, where $S$ is the
number of simulations (posterior draws) and $N$ is the number of data points.
## Fitting the model with RStan
Next we fit the model in Stan using the **rstan** package:
```{r, eval=FALSE}
library("rstan")
# Prepare data
url <- "http://stat.columbia.edu/~gelman/arm/examples/arsenic/wells.dat"
wells <- read.table(url)
wells$dist100 <- with(wells, dist / 100)
X <- model.matrix(~ dist100 + arsenic, wells)
standata <- list(y = wells$switch, X = X, N = nrow(X), P = ncol(X))
# Fit model
fit_1 <- stan("logistic.stan", data = standata)
print(fit_1, pars = "beta")
```
```
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[1] 0.00 0 0.08 -0.16 -0.05 0.00 0.05 0.15 1964 1
beta[2] -0.89 0 0.10 -1.09 -0.96 -0.89 -0.82 -0.68 2048 1
beta[3] 0.46 0 0.04 0.38 0.43 0.46 0.49 0.54 2198 1
```
## Computing approximate leave-one-out cross-validation using PSIS-LOO
We can then use the **loo** package to compute the efficient PSIS-LOO
approximation to exact LOO-CV:
```{r, eval=FALSE}
library("loo")
# Extract pointwise log-likelihood
# using merge_chains=FALSE returns an array, which is easier to
# use with relative_eff()
log_lik_1 <- extract_log_lik(fit_1, merge_chains = FALSE)
# as of loo v2.0.0 we can optionally provide relative effective sample sizes
# when calling loo, which allows for better estimates of the PSIS effective
# sample sizes and Monte Carlo error
r_eff <- relative_eff(exp(log_lik_1), cores = 2)
# preferably use more than 2 cores (as many cores as possible)
# will use value of 'mc.cores' option if cores is not specified
loo_1 <- loo(log_lik_1, r_eff = r_eff, cores = 2)
print(loo_1)
```
```
Computed from 4000 by 3020 log-likelihood matrix
Estimate SE
elpd_loo -1968.5 15.6
p_loo 3.2 0.1
looic 3937.0 31.2
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
```
The printed output from the `loo` function shows the estimates
$\widehat{\mbox{elpd}}_{\rm loo}$ (expected log predictive density),
$\widehat{p}_{\rm loo}$ (effective number of parameters), and ${\rm looic} =-2\,
\widehat{\mbox{elpd}}_{\rm loo}$ (the LOO information criterion).
The line at the bottom of the printed output provides information about the
reliability of the LOO approximation (the interpretation of the $k$ parameter is
explained in `help('pareto-k-diagnostic')` and in greater detail
in Vehtari, Simpson, Gelman, Yao, and Gabry (2019)). In this case the message tells us that
all of the estimates for $k$ are fine.
## Comparing models
To compare this model to an alternative model for the same data we can use the
`loo_compare` function in the **loo** package. First we'll fit a second model to the
well-switching data, using `log(arsenic)` instead of `arsenic` as a predictor:
```{r, eval=FALSE}
standata$X[, "arsenic"] <- log(standata$X[, "arsenic"])
fit_2 <- stan(fit = fit_1, data = standata)
log_lik_2 <- extract_log_lik(fit_2, merge_chains = FALSE)
r_eff_2 <- relative_eff(exp(log_lik_2))
loo_2 <- loo(log_lik_2, r_eff = r_eff_2, cores = 2)
print(loo_2)
```
```
Computed from 4000 by 3020 log-likelihood matrix
Estimate SE
elpd_loo -1952.3 16.2
p_loo 3.1 0.1
looic 3904.6 32.4
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
```
We can now compare the models on LOO using the `loo_compare` function:
```{r, eval=FALSE}
# Compare
comp <- loo_compare(loo_1, loo_2)
```
This new object, `comp`, contains the estimated difference of expected
leave-one-out prediction errors between the two models, along with the standard
error:
```{r, eval=FALSE}
print(comp) # can set simplify=FALSE for more detailed print output
```
```
elpd_diff se_diff
model2 0.0 0.0
model1 -16.3 4.4
```
The first column shows the difference in ELPD relative to the model with
the largest ELPD. In this case, the difference in `elpd` and its scale relative
to the approximate standard error of the difference) indicates a
preference for the second model (`model2`).
# References
Gelman, A., and Hill, J. (2007). *Data Analysis Using Regression and Multilevel Hierarchical Models.* Cambridge University Press.
Stan Development Team (2017). _The Stan C++ Library, Version 2.17.0._ https://mc-stan.org/
Stan Development Team (2018) _RStan: the R interface to Stan, Version 2.17.3._ https://mc-stan.org/
Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model
evaluation using leave-one-out cross-validation and WAIC. _Statistics and
Computing_. 27(5), 1413--1432. \doi:10.1007/s11222-016-9696-4.
[online](https://link.springer.com/article/10.1007/s11222-016-9696-4),
[arXiv preprint arXiv:1507.04544](https://arxiv.org/abs/1507.04544).
Vehtari, A., Simpson, D., Gelman, A., Yao, Y., and Gabry, J. (2019). Pareto
smoothed importance sampling.
[arXiv preprint arXiv:1507.02646](https://arxiv.org/abs/1507.02646).