-
Notifications
You must be signed in to change notification settings - Fork 18
/
monte-carlo.Rmd
405 lines (378 loc) · 12.7 KB
/
monte-carlo.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
# Monte Carlo Methods
Monte Carlo methods are used to compute high-dimensional integrals
corresponding to expectations. For example, if $Y \in \mathbb{R}$ is
a real-valued random variable and $f:\mathbb{R} \rightarrow
\mathbb{R}$ is a real-valued function, then the expectation of the
function applied to the random variable $Y$ is
$$
\mathbb{E}[f(Y)]
= \int_{\mathbb{R}} f(y) \cdot p_Y(y) \, \textrm{d} y,
$$
where $p_Y(y)$ is the density function for $Y$.
## Introduction to Monte Carlo Methods
Suppose it possible to generate a sequence of random draws
$$
y^{(1)}, \ldots, y^{(m)}, \ldots \sim p_Y(y)
$$
distributed according to $p_Y(y)$. With these random draws, the
expectation may be reformulated as as the limit of the average
value of the function applied to the draws,
$$
\mathbb{E}[f(y)]
=
\lim_{M \rightarrow \infty} \frac{1}{M} \sum_{m=1}^M f(y^{(m)}).
$$
Given finite computation time, an approximate result can be computed
for some fixed $M$ as
$$
\mathbb{E}[f(y)]
\approx \frac{1}{M} \sum_{m=1}^M f(y^{(m)}).
$$
This is known as a Monte Carlo method, after the casino of that name
in Monaco. If the samples are independent or drawn from a geometrically
ergodic Markov chain, the approximation will converge according to the
central limit theorem at a rate of $\mathcal{O}(\sqrt{\frac{1}{M}}).$
## Sensitivities of expectations
Suppose that $p(y \mid \theta)$ is the distribution of a random
variable $Y \in \mathbb{R}$ and there is some random quantity of
interest $f(Y)$ whose expectation for some fixed parameter values
$\theta$,
$$
\mathbb{E}[f(Y) \mid \theta]
= \int f(y) \cdot p_{Y \mid \Theta}(y \mid \theta) \, \textrm{d}y,
$$
is of interest.
The derivative of this expectation with respect to the parameters
$\theta$,
$$
\nabla \mathbb{E}\left[f(Y) \mid \theta\right],
$$
quantifies the rate of change in the expectaiton of $f(Y)$ as $\theta$
changes.
## The change-of-variables approach
In simple cases, the expectation can be expressed as $\mathbb{E}[f(Y,
\theta)]$ with gradients computed as
\begin{eqnarray*}
\nabla_{\theta} \, \mathbb{E}[f(Y, \theta)]
& = &
\nabla_{\theta} \int f(y, \theta) \cdot p(y) \textrm{d}y
\\[4pt]
& = &
\int \left( \nabla_{\theta} f(y, \theta) \right)
\cdot p(y)
\textrm{d}y
\\[4pt]
& \approx &
\frac{1}{M} \cdot
\sum_{m = 1}^M
\nabla_{\theta} f(y^{(m)}, \theta) \qquad \textrm{where} \ y^{(m)} \sim
p_Z(z).
\end{eqnarray*}
This is particularly useful when it is possible to express $Y$ using a
change of variables parameterized by $\theta$ from a simpler random
variable $Z$. More specifically, suppose $Y = g(Z, \theta)$ for a smooth
function of $\theta$ and that it is easy to draw
$$
z^{(m)} \sim p_{Z}(z).
$$
Given the easy to simulate variable $Z$ and smooth function $Y = g(Z, \theta),$
$$
\mathbb{E}[f(Y)]
=
\mathbb{E}[f(g(Z, \theta))].
$$
Gradients follow from the chain rule,
\begin{eqnarray*}
\nabla_{\theta} \mathbb{E}[f(g(Z, \theta))].
& = &
\nabla_{\theta}
\int f(g(z, \theta)) \cdot p(\theta)
\, \textrm{d}z
\\[4pt]
& = &
\int \left( \nabla_{\theta} f(g(z, \theta)) \right) \cdot p(\theta)
\, \textrm{d}z
\\[4pt]
& = &
\int f'(g(z, \theta))
\cdot \left( \nabla_{\theta} \, g(z, \theta) \right)
\cdot p(\theta)
\, \textrm{d}z
\\[4pt]
& \approx &
\frac{1}{M}
\cdot \sum_{m = 1}^M
f'(g(z^{(m)}, \theta)) \cdot \nabla_{\theta} g(z^{(m)}, \theta)
\qquad \textrm{where} \ z^{(m)} \sim p_Z(z)
\end{eqnarray*}
## Example: expecations of functions of multivariate normal variates
For example, suppose $Y \sim \textrm{normal}(\mu, \Sigma)$ and the
target expectation is $\mathbb{E}[f(Y) \mid \mu, \Sigma].$ A change of variables
approach to calculating the expectation draws a standard normal
variate $Z \sim \textrm{normal}(0, \textrm{I}),$ where $\textrm{I}$ is
the identity matrix, then transforms it to $Y = g(Z, \mu, \Sigma),$
where $g$ is the smooth function
$$
g(z, \mu, \Sigma) = \mu + \textrm{cholesky}(\Sigma) \cdot z.
$$
The Cholesky decomposition is defined so that
$\textrm{cholesky}(\Sigma) = L_{\Sigma}$ is the unique lower
triangular matrix satisfying
$$
\Sigma = L_{\Sigma} \cdot L_{\Sigma}^{\top}.
$$
Now if $z \sim \textrm{normal}(0, \textrm{I})$ and
$y = g(z, \mu, \Sigma) = \mu + L_{\Sigma} \cdot z,$ then
$z \sim \textrm{normal}(\mu, \Sigma).$
The gradient of the target expectation can be defined by change of
variables,
\begin{eqnarray*}
\nabla_{\mu, \Sigma} \, \mathbb{E}[f(Y) \mid \mu, \Sigma]
& = &
\nabla_{\mu, \Sigma}
\int f(y) \cdot \textrm{normal}(y \mid \mu, \Sigma)
\, \textrm{d}y.
\\[4pt]
& = &
\nabla_{\mu, \Sigma}
\int f(g(z, \mu, \Sigma))
\cdot \textrm{normal}(z \mid 0, 1)
\, \textrm{d}{z}
\\[4pt]
& = &
\nabla_{\mu, \Sigma}
\int f(g(z, \mu, \Sigma))
\cdot \textrm{normal}(z \mid 0, 1)
\, \textrm{d}{z}
\\[4pt]
& = &
\int \left( \nabla_{\mu, \Sigma} \,
f(g(z, \mu, \Sigma)) \right)
\cdot \textrm{normal}(z \mid 0, 1)
\, \textrm{d}z
\\[4pt]
& \approx &
\frac{1}{M} \cdot \sum_{m = 1}^M
\nabla_{\mu, \Sigma} \, f(g(z^{(m)}, \mu, \Sigma))
\\[4pt]
& \approx &
\frac{1}{M} \cdot \sum_{m = 1}^M
f'(g(z^{(m)}, \mu, \Sigma)) \cdot \nabla_{\mu, \Sigma} \, g(z^{(m)}, \mu, \Sigma),
\end{eqnarray*}
where $z^{(m)} \sim \textrm{normal}(0, \textrm{I})$. Because $g$ is
defined to be smooth and automatically differentiable, as long as the function
$f(y)$ can be automatically differentiated, so can
$\mathbb{E}[f(Y) \mid \mu, \Sigma] = \mathbb{E}[g(Z, \mu, \Sigma)].$
## The score function approach
The *score function* for the variable $Y$, which depends on parameters
$\Theta$, is
$$
\textrm{score}(\theta) = \nabla \log p_{Y \mid \Theta}(y \mid \theta).
$$
By the chain rule,
$$
\nabla \log p_{Y \mid \Theta}(y \mid \theta)
= \frac{1}{\displaystyle p_{Y \mid \Theta}(y \mid \theta)}
\cdot \nabla p_{Y \mid \Theta}(y \mid \theta).
$$
Rearranging terms expresses the gradient of the density in terms of
the score function and density,
$$
\nabla p_{Y \mid \Theta}(y \mid \theta)
= p(y \mid \theta) \cdot \nabla \log p(x \mid \theta).
$$
The following derivation provides the basis for using Monte Carlo
methods to calculate gradients.
\begin{eqnarray*}
{\large \nabla}_{\!\theta} \ \mathbb{E}\!\left[f(Y) \mid \theta\right]
& = &
{\large \nabla}_{\!\theta} \int f(y)
\cdot p_{Y \mid \Theta}(y \mid \theta)
\, \textrm{d} y
\\[4pt]
& = &
\int f(y)
\cdot \left( {\large \nabla}_{\!\theta} \, p_{Y \mid \Theta}(y \mid \theta) \right)
\, \textrm{d} y
\\[4pt]
& = &
\int f(y)
\cdot \left( {\large \nabla}_{\!\theta} \log \, p_{Y \mid \Theta}(y \mid \theta) \right)
\cdot p_{Y \mid \Theta}(y \mid \theta)
\, \textrm{d} y
\\[4pt]
& = &
\mathbb{E}\!\left[f(Y) \cdot {\large \nabla}_{\!\theta} \log \, p_{Y \mid
\Theta}(Y \mid \theta) \, \Big| \, \theta \right]
\\[4pt]
& = &
\lim_{M \rightarrow \infty}
\frac{1}{M}
\cdot \sum_{m = 1}^M y^{(m)}
\cdot {\large \nabla}_{\!\theta} \log \, p_{Y \mid \Theta}(y
\mid \theta)
\\[4pt]
& \approx &
\frac{1}{M}
\cdot \sum_{m = 1}^M y^{(m)}
\cdot {\large \nabla}_{\!\theta} \log \, p_{Y \mid \Theta}(y
\mid \theta),
\end{eqnarray*}
where $y^{(m)} \sim p_{Y \mid \Theta}(y \mid \theta).$ Because $Y$ is
random and $\mathbb{E}\left[f(Y) \mid \theta\right]$ is a conditional
expectation, the gradient is taken with respect to $\theta.$ The first
line follows from the the definition of expectations. The second step
uses the dominated convergence theorem to distribute the gradient into
the integral (past the constant $f(y)$). The third step uses the
score function derivation of the gradient of the density. The fourth
step is again the definition of an expectation, where again $\theta$ is
fixed. The penultimate step involves the fundamental rule of Monte
Carlo integration, assuming $y^{(m)} \sim p_{Y \mid \Theta}(y \mid
\theta).$ The final step is the approximation resulting from using
only finitely many draws.
## Example: Monte Carlo expectation maximization (MC EM)
Given observed data $y$, unobserved (latent or missing) data $z$,
parameters $\theta$, and total data likelihood function $p(y, z \mid
\theta),$ the *maximum marginal likelihood estimate* is (where it exists),
$$
\textrm{argmax}_{\theta} \ p(y \mid \theta).
$$
The missing data $z$ is marginalized out of the total data likelihood
to produce the likelihood of the observed data,
$$
p(y \mid \theta) = \int p(y, z \mid \theta) \, \textrm{d}z.
$$
The expectation maximization algorithm works iteratively by selecting
an initial $\theta^{(1)}$ and then setting
$$
\theta^{(n + 1)}
= \textrm{arg max}_{\theta} \ Q(\theta, \theta^{(n)}),
$$
where the function $Q$ is defined by
$$
Q(\theta, \theta^{(n)})
= \int_Z
\log p(\theta \mid z, y)
\cdot p(z \mid \theta^{(n)}, y) \,
\textrm{d}z.
$$
In order to find $\theta^{(n+1)}$ it is helpful to have gradients
of the objective function $Q(\theta, \theta^{(n)})$ being optimized,
\begin{eqnarray*}
\nabla_{\theta} \, Q(\theta, \theta^{(n)})
& = &
\nabla_{\theta}
\int_Z
\log p(\theta \mid z, y)
\cdot p(z \mid \theta^{(n)}, y) \,
\textrm{d}z
\\[4pt]
& = &
\int_Z
\left( \nabla_{\theta} \log p(\theta \mid z, y) \right)
\cdot p(z \mid \theta^{(n)}, y) \,
\textrm{d}z
\\[4pt]
& \approx &
\frac{1}{M}
\cdot \sum_{m=1}^M
\nabla_{\theta} \log p(\theta \mid z^{(m)}, y),
\end{eqnarray*}
where $z^{(m)} \sim p(z \mid \theta^{(n)}, y).$
## Example: automatic differentiation variational inference (ADVI)
In a Bayesian model, the focus is on the posterior distribution
$p(\theta \mid y),$ where $y$ is observed data and $\theta$ are the
model parameters. Variational inference seeks to find a parametric
distribution $p(\theta \mid \phi)$ with parameters $\phi$ that
approximates the posterior $p(\theta \mid y)$. The goodness of fit is
measured by the *Kullback-Leibler (KL) divergence* from the approximating
distribution to the true posterior. The goal is to find the
parameters $\phi$ that minimize this divergence,
$$
\hat{\phi}
=
\textrm{arg min}_{\phi}
\ \textrm{KL}\!\left[p(\theta \mid \phi) \, \big|\big| \ p(\theta \mid y)\right].
$$
KL-divergence is defined by
$$
\textrm{KL}\!\left[\, p(\theta \mid \phi) \, \big|\big| \ p(\theta \mid y) \right]
= \int_{\Theta}
p(\theta \mid \phi)
\cdot \log \frac{p(\theta \mid \phi)}
{p(\theta \mid y)}
\, \textrm{d}\theta.
$$
Differentiating yields
\begin{eqnarray*}
\nabla_{\phi}
\
\textrm{KL}\!\left[\, p(\theta \mid \phi) \, \big|\big| \ p(\theta \mid y) \right]
=
\nabla_{\phi}
\int_{\Theta}
p(\theta \mid \phi)
\cdot \log \frac{p(\theta \mid \phi)}
{p(\theta \mid y)}
\, \textrm{d}\theta.
\end{eqnarray*}
For *automatic differentiation variational inference* (ADVI), the
approximating distribution is assumed to be multivariate normal
with mean and covariance parameters $\phi = (\mu, \Sigma)$,
$$
p(\theta \mid \phi) = \textrm{normal}(\theta \mid \mu, \Sigma).
$$
Using the normal change of variables approach described above
with $z \sim \textrm{normal}(0, \textrm{I})$ and
$\theta = g(z, \mu, \Sigma) = \mu + \textrm{cholesky}(\Sigma) \cdot
z,$ the gradient may be calculated by Monte Carlo methods as
\begin{eqnarray*}
\nabla_{\phi}
\
\textrm{KL}\!\left[\, p(\theta \mid \phi) \, \big|\big| \ p(\theta \mid y) \right]
& = &
\nabla_{\mu, \Sigma}
\int_{\Theta}
\textrm{normal}(\theta \mid \mu, \Sigma)
\cdot \log \frac{\textrm{normal}(\theta \mid \mu, \Sigma)}
{p(\theta \mid y)}
\, \textrm{d}\theta
\\[4pt]
& = &
\nabla_{\mu, \Sigma}
\int_{\Theta}
\textrm{normal}(z \mid 0, 1)
\cdot \log \frac{\textrm{normal}(g(z, \mu, \Sigma) \mid \mu, \Sigma)}
{p(g(z, \mu, \Sigma) \mid y)}
\, \textrm{d}z
\\[4pt]
& = &
\int_{\Theta}
\textrm{normal}(z \mid 0, 1)
\cdot \nabla_{\mu, \Sigma}
\log \frac{\textrm{normal}(g(z, \mu, \Sigma) \mid \mu, \Sigma)}
{p(g(z, \mu, \Sigma) \mid y)}
\, \textrm{d}z
\\[4pt]
& \approx &
\frac{1}{M} \cdot \sum_{m = 1}^M
\nabla_{\mu, \Sigma}
\log \frac{\textrm{normal}(g(z^{(m)}, \mu, \Sigma) \mid \mu, \Sigma)}
{p(g(z^{(m)}, \mu, \Sigma) \mid y)},
\end{eqnarray*}
where the $z^{(m)} \sim \textrm{normal}(0, \textrm{I}).$ Because the
function $g$ was constructed to be smooth, the derivative of the
expectation can be calculated using automatic differentiation as long
as the log posterior density function $\log p(\theta \mid y)$ can be automatically
differentiated.
This brute-force derivation ignores the fact that $\mathbb{E}[\log
\textrm{normal}(\Theta \mid \mu, \Sigma) \mid \mu, \Sigma]$ is the
entropy of a normally distributed random variable $\Theta \sim
\textrm{normal}(\mu, \Sigma)$, the gradient of which is available in
closed form.
## References
[@mohamed:2019] is a thorough reference to using Monte Carlo gradients
in statistical and machine-learning algorithms. [@wei:1990] provides a
concise overview of gradients in the Monte Carlo EM
algorithm. [@kucukelbir:2017] introduces automatic differentiation
variational inference.