Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve step size adaptation target #3105

Open
nhuurre opened this issue Feb 22, 2022 · 17 comments
Open

Improve step size adaptation target #3105

nhuurre opened this issue Feb 22, 2022 · 17 comments

Comments

@nhuurre
Copy link
Contributor

nhuurre commented Feb 22, 2022

Summary:

The step size adaptation targets a suboptimal "acceptance statistic".

Description:

About two and half years ago issue #2789 pointed out a problem with the calculation of the adaptation target accept_stat. The issue was closed after a pull request with a proposed fix was completed but unfortunately the pull request was rejected and the problem is still present in Stan 2.29. This also came up on the forum last summer.

Background reading: https://arxiv.org/abs/1411.6669 finds the optimal stepsize for static HMC. In short: the ratio of accepted proposals to gradient evaluations is maximized when the average Metropolis acceptance probability is 0.8.
The acceptance probability is related to the energy error in the trajectory

w = exp(-H(q,p))
accept_prob = min(1,w_2/w_1)

where w_1 is the Boltzmann weight of the initial point and w_2 of the final point. The sampler selects the candidate sample with probability accept_prob and adaptation adjusts the step size up or down depending on whether accept_prob is above or below 0.8.

Before we try to apply this to Stan, I'll mention one thing that bugs me about this use of Metropolis acceptance probability as the adaptation target for static HMC. And that is: a significant number (about 1/3) of transitions have w_2 > w_1 and an acceptance probability saturating to 1. But the magnitude of the energy error should be informative even in this case; it would be more efficient to not ignore it.
To remedy this recall that the transition kernel is reversible: if q_1 -> q_2 is possible then so is q_2 -> q_1 and the rates at which these will happen are proportional to the Boltzmann weights w_1 and w_2. Taking the weighted average of acceptance probabilities in both directions gives a symmetric acceptance statistic

accept_stat = 2*min(w_1, w_2)/(w_1 + w_2)

which has the same mean as the saturating acceptance probability but smaller variance.

Now, of course, Stan's sampler is not static HMC. Instead of considering only the last point for acceptance or rejection, Stan chooses a Boltzmann-weighted multinomial sample from the entire trajectory, with bias towards the last doubling. Adaptation strategy is similar but now the acceptance statistic cannot simply be a Metropolis acceptance probability.

The original Hoffmann & Gelman NUTS paper proposed using the average of accept_prob for all points in the last half of the trajectory as the proxy accept statistic and that's how Stan calculated it until #352 changed it to be the average over all points in the entire trajectory (even the unconditionally rejected ones, should the trajectory terminate early).
I'm not sure which is better. On one hand short jumps that don't reach the last doubling should not be ignored, on the other hand they contribute much less to the effective sample size and I don't think you should count them as a full accepted sample when trying to maximize the number of accepted samples per computational effort.

Either way, I'm going to point out that--just like with static HMC--you can reduce the variance without changing the mean by replacing the Metropolis probability with the symmetric acceptance statistic.

Next up, issue #2789. The proposal there is that instead of a simple arithmetic average, accept_stat should be calculated as a weighted average. This makes some amount of sense but there are two issues that concern me.

  1. the proposed weights are Boltzmann weights without considering the last-doubling bias so they do not really correspond to selection probabilities
  2. the problem of counting short jumps as full acceptances may be even worse if very short distances tend to have smaller energy errors (and hence, higher weights)

Also, it is not obvious whether the statistic to average should be the (saturating) Metropolis acceptance probability or the symmetrized acceptance statistic--unlike with the previous accept_stat proxies these give different results with Boltzmann weighting.

Finally, let's consider an alternative idea: what if you use the probability of selecting from the last doubling as the accept_stat proxy?
The sampler implements the last-doubling-bias in a very Metropolis-like fashion: if the total weight of the second subtree is greater than the first select a sample from the second, otherwise select it with probability W_2/W_1 (where Ws are summed Boltzmann weights over the entire subtree). You could use this probability as the adaptation proxy with the rationale that the last doubling is what gives you most effective samples.
There's two problems with this idea:

  1. there's no theoretical guidance for selecting the optimal target value--the aforementioned static HMC analysis only applies to pointwise energy errors but this new accept_stat depends on trajectory-averaged Boltzmann weighs
  2. empirically, it's totally unsuitable as an adaptation target because it's wildly non-monotonic

acceptstat

I think the non-monotonicity in this plot mostly results from the symmetry of the 50-dimensional normal distribution. In a normal distribution the sampler moves in circles around the mean; at first the energy error increases until the simulated particle has reached 90 degrees from the starting point; but then starts decreasing and returns to almost zero when the movement is at 180 degrees around. If the stepsize is a power-of-two fraction of a full cycle then the two halves of the final trajectory are always symmetrical and have equal weight regardless of pointwise energy errors within them.

So, what now? The last doubling acceptance rate is an attractive proxy for effective sample size per computational cost (in fact, in the above example effective sample size also goes up and down in sync with fluctuations in the acceptance rate) if you could somehow make it strictly monotonic. One possible monotonic proxy can be created by taking the symmetric last doubling acceptance statistic

accept_stat = 2*min(W_1, W_2)/(W_1 + W_2)

and applying Jensen's inequality to derive a lower bound for it

lower_bound_stat = 2*sum_i(min(w_1, w_i))/(W_1 + W_2)

where w_1 is the average weight in the first subtree and i ranges over the second subtree.

Current Version:

v2.29.0

@betanalpha
Copy link
Contributor

betanalpha commented Mar 1, 2022 via email

@nhuurre
Copy link
Contributor Author

nhuurre commented Mar 31, 2022

(oops, took longer than I expected to get back to this...)

To be precise the derivation in that paper is defined not just for static integration times but also for the original Hybrid Monte Carlo implementation of Hamiltonian Monte Carlo where only the final state of the numerical Hamiltonian trajectory is considered.

Yes, sorry, the important difference is not static vs dynamic integration time but multinomial sampling from the whole trajectory vs endpoints only.

the entire calculation is predicated around that particular mathematical form. In particular the Metropolis form defines the objective function

The optimality criterion depends only the expectation value of the objective function. There are many different objective functions that give the same expectation value.

  • a0 = 1 if the proposal was accepted else 0
  • a1 = min(1,w_2/w_1)
  • a2 = 2*min(w_1,w_2)/(w_1+w_2)
  • any linear mixture of the above

These all give the same adaptated stepsize but a2 has the smallest variance and hence the fastest convergence.

These also differ when trying to generalize to multinomial selection.

  • a0 is not suitable for averaging over many proxy transitions
  • if the weights are uniform then a1 and a2 lead to the same target stepsize but a2 again has a smaller variance
  • if the weights are Boltzmann-weights then a1 always leads to a target stepsize that is larger than that of a2

The proposed change in #2789 went from uniform weights across this ensemble of proxy Metropolis transitions to the multinomial weights, allowing the final states that are more likely until the actual multinomial sampling to influence the adaptation statistic more. The intuition is that this provides a better approximation of the exact multinomial transition as an ensemble of Metropolis transitions

The part I don't understand is why use a weighting that provides "a better approximation" (than uniform) instead of just using the exact multinomial sampling probabilities as weights. The next paragraph ("The challenge [...] is combining all of the conditional sub-tree probabilities") sounds to me like you meant to say "yes, it would be better to weight by the actual selection probabilities but those are too difficult to compute" but I'm not sure.

think that the first step in considering a more heuristic acceptance statistic/proxy that isn’t an average of weighted Metropolis acceptance probabilities, and hence somewhat connected to the original analysis for why 0.8 is the more robustly optimal adaptation target, is an empirical study similar to what I performed for #2789

While "last-doubling-lower-bound accept_stat" was derived from different considerations, numerically it is fairly similar to the accept_stat in the original NUTS paper. It can be computed by substituting the mean weight of the first subtree for the weight of the initial point in the expression for the original accept_stat.

Taking inspiration from the tests you reported in https://discourse.mc-stan.org/t/request-for-volunteers-to-test-adaptation-tweak/9532/53 here's some plots. All with diagonal mass matrix, 4 chains and 2000 adaptation iterations.

normal

Here symmetric is the average of a2 from before as opposed to Stan's current "average of a1" accept_stat. The apparent difference between them is due to incomplete adaptation; with term_buffer=200 Stan's accept_stat yields a plot indistinguishable from the above symmetric.

I wasn't sure if "correlated normal rho=0.5" meant every off-diagonal entry in the correlation matrix is 0.5, or only adjacent variables have correlation 0.5 and long-distance correlations fall off accordingly. Looking at these results, it was the latter.

multinormal075

multinormal025

student3

Let's throw in Neal's funnel too, x[1] ~ normal(0,1), x[2:] ~ normal(0, exp(x[1])). A red dot means there are post-warmup divergences and this time adapt_delta=0.9.

funnel

@betanalpha
Copy link
Contributor

betanalpha commented May 4, 2022 via email

@nhuurre
Copy link
Contributor Author

nhuurre commented May 5, 2022

do you have a reference for E[a2] = E[a1] and Var[a2] < Var[a1]?

My thinking on this is that HMC generates not just a Markov chain of points

q_1 -> q_2 -> q_3 -> q_4 -> ...

but also a Markov chain of trajectories

T_1 -> T_2 -> T_3 -> T_4 -> ...

with q_i,q_{i+1} \in T_i and the trajectory T_i is thought of as an unordered set.
The trajectory distribution is a more general object than the target distribution because a multinomial sample from a random trajectory is a random sample from the target distribution.
The acceptance statistic is a function a1(q_i, T_i) so more precisely you need the Markov chain

(q_1,T_1) -> (q_2,T_2) -> (q_3,T_3) -> (q_4,T_4) -> ...

but I'm pretty sure the distribution this samples from in the long run is the same as sampling T from the trajectory distribution and then drawing an initial point q conditional on that trajectory P(q|T) \propto w(q).
In this form marginalization is straightforward.

a2(T) = sum_i[a1(q_i, T)w_i] / sum_i[w_i]

where the sums range over all q_i \in T. The previous a2(q_1,q_2) = 2 min(w_1,w_2)/(w_1+w_2) is the result when each trajectory contains exactly two points.

(so, to answer your question, no, I don't have a reference, just a "I'm pretty sure...")

I think the confusion may be in to what approximation I was referring.

The confusion was what "multinomial" refers to as I would have said "unbiased multinomial weights" for what you call "multinomial sampling weights" and "biased multinomial weights" for subtree-probability adjusted weights. (I don't even know why it's called multinomial since there's only one draw per subtree...)

Keeping track of the exact multinomial weights is straightforward, but the subtree probabilities are harder to propagate through the recursion.

It's not insurmountable. The current recursion accumulates subtree weights and sums after every doubling.

sum_accept_stat += sum_accept_stat_subtree;
weight += weight_subtree;

and the final accept_stat = sum_accept_stat/weight. But the update can be written in more clearly recursive manner as

accept_stat = (1-p)*accept_stat + p*sum_accept_stat_subtree/weight_subtree;

where p = weight_subtree/(weight + weight_subtree) is the (unbiased multinomial) probability of accepting the new subtree.
Replacing that p with the true subtree acceptance probability gives an adjusted accept_stat that counts every point with the exact selection probability.

How many runs were done for each dimension? Are you plotting the average?

Only one run per dimension; I figured adjacent dimensions are similar enough to be approximate replicates and the zig-zag gives a feel for run-to-run variability.
n_leapfrog and accept_stat are averages, step size is for the first chain, and ESS is the first dimension (effectively a random direction, except for the funnel where it's the funnel direction (probably the slowest mixing?))

What does the effective sample size vs number of gradient evaluations look like?

ess-leapfrog

Where there E-FMI warnings for the higher dimensionality targets?

I forgot to check that. Running again with arviz.bfmi() shows

funnel-efmi

So, very bad E-FMI. (Unsurprising, I guess, since ESS was also quite low.)


I was going to write more about adapting arxiv:1411.6669 to multinomial selections but re-reading it I realized I misunderstood it and am utterly confused.
The cost measure used in the paper is (page 3)

The number of attempts required to produce an accepted proposal follows a geometric distribution with the probability of success a(q)

and computes the cost as

the expected cost of generating an accepted proposal is given by averaging the expected number of rejections over the position space,

This is the expected cost to the first accepted proposal starting from a random initial state. But isn't the quantity of interest the total cost divided by the total number of acceptances in a long-running chain? First time reading, I thought these were the same thing but

  • The expected number of proposals to first acceptance is the expected residence time E[1/a(q)]. It is also the expected number of proposals to the next acceptance when you start from a random point in a very long chain
  • The number of acceptances in a chain is proportional to the probability that, when picking a random point from the chain, the next proposal is accepted. The frequency of the state q in the chain is p(q) so this number is sum_q(p(q)a(q))=E[a(q)] and consequently the long-term average proposals per acceptances is 1/E[a(q)]

They're not equal! Why does the paper (and Beskos et al) optimize residence time instead of the total number of acceptances?

@nhuurre
Copy link
Contributor Author

nhuurre commented May 7, 2022

do you have a reference for E[a2] = E[a1] and Var[a2] < Var[a1]?

no, I don't have a reference, just a "I'm pretty sure..."

If you want a paywalled article, the abstract of https://doi.org/10.1016/S0378-3758(99)00079-8 begins

We introduce a form of Rao–Blackwellization for Markov chains which uses the transition distribution for conditioning. We show that for reversible Markov chains, this form of Rao–Blackwellization always reduces the asymptotic variance

which sounds a lot like filling in the details for the "I'm pretty sure" part but I can't promise you'll get your money's worth if you pay Elsevier $9.50 for 24 hours of access.

@bob-carpenter
Copy link
Contributor

You can find a pdf on Google Scholar.

@betanalpha
Copy link
Contributor

betanalpha commented Jun 1, 2022 via email

@nhuurre
Copy link
Contributor Author

nhuurre commented Jul 6, 2022

in my opinion the acceptance statistic in the old PR is the best match to the current sampler behavior both theoretically and empirically.

I proposed two changes

  1. Metropolis adjusted weights instead of raw Boltzmann weights
  2. Rao-Blackwellized accept_prob for the proxy transitions

In my opinion (1) takes us closer to the current sampler behavior and the effect of (2) is not theoretically understood but could plausibly be an improvement.

Were we to use that Rao-Blackwellized estimator then I agree that your suggested “acceptance statistic” would be appropriate. Because we don’t use that estimator, however, I think that that acceptance statistic would in general lead to an over-aggressive adaptation relative to the estimator that we do use.

I must say, Michael, sometimes I find it very difficult to follow your logic. Rao-Blackwellization does not change the expected value of the estimator and previously you said

I agree that the step size optimization procedure will hold for any objective function with the same expectation value as the Metropolis acceptance probability

but now you've decided that Rao-Blackwellizing the adaptation objective is appropriate only if every other estimator is Rao-Blackwellized as well.

Let me respond a little bit out of order.

I don't think piecemeal replies to individual statements is getting us on the same page so I'll just recap the whole theory behind the step size adaptation.

Random-Walk Metropolis-Hastings MCMC

The story begins with the Metropolis-Hastings algorithm. The algorithm samples from the distribution $p\left(q\right)\propto e^{-V\left(q\right)}$. The proposal for the next state in the Markov chain is $q^{\star}=q+d$ where $d\sim N\left(0,\sigma\right)$ and it is accepted with probability $e^{V\left(q\right)-V\left(q^{\star}\right)} = e^{-\Delta\left(q,q^{\star}\right)}$

The theory for the optimal “step size” $\sigma$ is based on the paper

Roberts, Gelman, Gilks: Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms. (The Annals of Applied Probability, 1997)

They assume a target distribution that decomposes into a very large number of IID variables $\left(q_{i}\right)_{i=0}^{N}$ . I'm going to skip the details of the argument and just give a brief overview of the key points.

Due to the IID assumption the potential is a sum
$$V\left(q\right) = \sum_{i=0}^{N}V\left(q_{i}\right)$$
They separate a single component from the ensemble for closer study. The acceptance probability is determined by
$$\Delta\left(q,q^{\star}\right) = \Delta_{0} + \sum_{i=1}^{N}\Delta_{i} = \Delta_{0}+\Delta_{\mathcal{E}}$$

The ensemble is a sum of independent random variables so (assuming suitable regularity conditions) central limit theorem applies and the ensemble $\Delta_{\mathcal{E}}$ is normally distributed; one just needs to sum the means and variances of the components. Each component can be approximated with Taylor expansion
$$\Delta_{i}\left(q_{i},q_{i}^{\star}\right) \approx V^{\prime}\left(q_{i}\right)d_{i} + \frac{1}{2}V^{\prime\prime}\left(q_{i}\right)d_{i}^{2}$$
that is a random variable with mean $\frac{1}{2}V^{\prime\prime}\left(q_{i}\right)\sigma^{2}$ and variance $V^{\prime}\left(q_{i}\right)^{2}\sigma^{2}+O\left(\sigma^{4}\right)$ . Furthermore, once our Markov chain has converged, $q_{i}$ are samples from the target distribution so the sums can be approximated by expected values
$$\frac{1}{N}\sum_{i}F\left(q_{i}\right)\approx\int\mathrm{d}q p\left(q\right)F\left(q\right)$$
The two integrals are easily show to be equal; define
$$I\overset{\mathrm{def}}{=}\int\mathrm{d}qp\left(q\right)V^{\prime}\left(q\right)^{2}=\int\mathrm{d}q p\left(q\right)V^{\prime\prime}\left(q\right)$$

All that is to say, the ensemble follows the normal distribution $\Delta_{\mathcal{E}}\sim N\left(\frac{1}{2}NI\sigma^{2},NI\sigma^{2}\right)$
(For the acceptance probability to be non-negligible we must have $\Delta_{\mathcal{E}} = O\left(1\right) $ and $\Delta_{0} = O\left(\frac{1}{\sqrt{N}}\right) \ll 1$ )
Now it's straightforward to integrate out $\Delta_{\mathcal{E}}$ in $a=1\wedge e^{-\Delta_{0}-\Delta_{\mathcal{E}}}$ . The marginal acceptance probability is
$$a\left(q,\Delta_{0}\right)\approx2\Phi\left(-\frac{1}{2}\sqrt{NI}\sigma\right)\left(1-\frac{1}{2}\Delta_{0}\right)$$
The $\Delta_{0}$ dependence here changes the overall acceptance probability less than you might think; it's effect is mainly that conditioning on acceptance biases the distribution of $d_{0}$ so that it's approximately $N\left(-\frac{1}{2}V^{\prime}\left(q_{0}\right)\sigma^{2},\sigma^{2}\right)$ .

In the limit $N\to\infty $ (where also $\sigma\to0$ ) this (biased) random walk becomes a diffusion process whose speed is determined by the diffusion coefficient D aka the mean squared jump distance $D=a\sigma^{2}\propto a\Phi^{-1}\left(\frac{a}{2}\right)^{2}$ . This is maximized when a=0.234.

Varying acceptance probability

In the above discussion the Metropolis acceptance probability is entirely determined by the ensemble average and does not meaningfully depend on any individual component. Consequently the diffusion coefficient is constant across the entire typical set and (as Roberts, Gelman, and Gilks point out) all measures of efficiency are equivalent.
However, in finite dimensions $a\left(q\right) $ may vary across the typical set and the “effective diffusion speed” must be some kind of average of the diffusion coefficient, for example $\mathbb{E_{\mathnormal{q}}}\left[D\left(q\right)\right] $ or $\mathbb{E_{\mathnormal{q}}}\left[\sqrt{D\left(q\right)}\right]^{2} $ or whatever. The preferred choice in the literature is the harmonic mean (aka mean residence time) $\mathbb{E_{\mathnormal{q}}}\left[D\left(q\right)^{-1}\right]^{-1}$ although I haven't seen any explanation as to why. After thinking about it a bit and doing a couple of experiments, I'm moderately certain this probably is the best choice. Since everyone else seems to think it's too obvious to explain I'm not going to discuss it more.

Rao-Blackwellization

One more thing before we move on to Hamiltonian Monte Carlo. The adaptation algorithm needs an estimate of the average acceptance probability. The obvious estimator is just the acceptance probability for the current transition $a\left(q,q^{\star}\right)=1\wedge\frac{w^{\star}}{w}$ but that is not necessarily the optimal one.
A paper I linked earlier in this thread discussed a way to Rao-Blackwellize MCMC estimators. It's not exactly the same thing but here's my idea: compute the average of the acceptance probabilities for transitions in both directions
$$a_{\mathrm{RB}}\left(q,q^{\star}\right)=\frac{w}{w+w^{\star}}a\left(q,q^{\star}\right)+\frac{w^{\star}}{w+w^{\star}}a\left(q^{\star},q\right)=\frac{2\left(w\wedge w^{\star}\right)}{w+w^{\star}}$$

This should be a more stable estimator with the same mean. I haven't tested it in RWMH but this marginalization can also be applied to Stan's current accept_stat. (It's not obvious at all that the same formula would be correct with a more complicated algorithm like NUTS; that it works in specific case is a convenient coincidence.) My previous plots labelled this the “symmetric” accept_stat. Here's a direct comparison:

symmetric

My takeaways are:

  1. Stan's default step size adaptation window (term_buffer=50) is too short for the step size to converge.
  2. Rao-Blackwellized objective converges to the same asymptotic step size significantly faster.

Metropolis HMC

Now, let's get to HMC. “Metropolis HMC” (a term I just made up, usually called just HMC) is like RWMH but the proposal $q^{\star}$ is constructed as the endpoint of a trajectory (of fixed length L) that follows the Hamiltonian flow. This allows the point to move much farther with high acceptance probability but requires multiple evaluations of the target density along the trajectory.

The relevant paper is

Betancourt, Byrne, Girolami: Optimizing The Integrator Step Size for Hamiltonian Monte Carlo (2014)

They analyze the Hamiltonian flow and find that in the limit of either infinite dimensions or vanishing step size the distribution of the energy error converges to normal.

$$-\Delta\left(q,q^{\star}\right)\sim N\left(-\frac{1}{2}\alpha\epsilon^{4},\alpha\epsilon^{4}\right) $$

(though they silently switch to (equivalent) $N\left(-\alpha\epsilon^{4},2\alpha\epsilon^{4}\right)$ in subsequent calculations, probably to throw off anyone trying to follow along.)

The cost of building the trajectory is $\frac{L}{\epsilon}$ and, provided L is short enough that movement is diffusive, effective samples per iteration is determined by the diffusion coefficient $D\left(q\right)=\mathbb{E_{\mathnormal{p}}}\left[a\left(q,p\right)L^{2}\right]$ (just like it was with RWMH) (Of course, we'd like to have L long enough to avoid diffusion but hopefully that won't change the optimality criterion—if there's an argument that justifies this cost measure without reference to a diffusion process then I've missed it.)
The total cost per effective sample is then $\propto\frac{1}{\epsilon}\mathbb{E_{\mathnormal{q}}}\left[\frac{1}{D\left(q\right)}\right]\propto\frac{1}{\epsilon}\mathbb{E_{\mathnormal{q}}}\left[\frac{1}{\mathbb{E_{\mathnormal{p}}}\left[a\left(q,p\right)\right]}\right]$ . They then point out that this can be bounded from above and below

$$\frac{1}{\mathbb{E_{\mathnormal{q}}}\left[\mathbb{E_{\mathnormal{p}}}\left[a\left(q,p\right)\right]\right]}\leq\mathbb{E_{\mathnormal{q}}}\left[\frac{1}{\mathbb{E_{\mathnormal{p}}}\left[a\left(q,p\right)\right]}\right]\leq\mathbb{E_{\mathnormal{q}}}\left[\mathbb{E_{\mathnormal{p}}}\left[\frac{1}{a\left(q,p\right)}\right]\right] $$

and the expectations $\mathbb{E_{\mathnormal{q}}}\left[\mathbb{E_{\mathnormal{p}}}\left[\cdot\right]\right] $ can be computed from the previous result $-\Delta\left(q,q^{\star}\right)\sim N\left(-\frac{1}{2}\alpha\epsilon^{4},\alpha\epsilon^{4}\right) $ .

Figure 3b of the paper compares these theoretical bounds to empirical efficiency estimates:

figure3b

You may notice that the d=1000 the line follows the lower bound very closely. I believe that is because in high dimensions $a\left(q\right)=\mathbb{E_{\mathnormal{p}}}\left[a\left(q,p\right)\right] $ is constant across the typical set and consequently $\mathbb{E_{\mathnormal{q}}}\left[a\left(q\right)^{-1}\right]=\mathbb{E_{\mathnormal{q}}}\left[a\left(q\right)\right]^{-1} $ (i.e. the so-called lower bound is actually exact.)

Their recommended adaptation target is the minimum of the cost upper bound at a=0.8.

In theory, Rao-Blackwellizing the acceptance probability estimator should work the same here as with RWMH.

Multinomial HMC

An HMC trajectory contains $\frac{L}{\epsilon}$ points but Metropolis HMC only selects either endpoint. Stan, however, selects a random point from the entire trajectory with multinomial probabilities. Ensuring reversibility and detailed balance in this situation is a bit complicated but the specifics of the algorithm don't matter for our purposes.

The theory for optimal tuning of multinomial transitions is actually pretty simple: there's no theory, we just take the initial point and the selected final point, then pretend this was a Metropolis HMC transition all along.

This can be Rao-Blackwellized in a couple of different ways:

  1. Don't use just the selected final point but marginalize over all potential final points in the trajectory. This is what Stan currently approximates (poorly) and what the old “Boltzmann-weights” pull request approximates better.
  2. Sum over not only all possible final points but also all possible initial points that could have created this trajectory. Such marginalization would give even more stable estimator but is difficult to implement efficiently.
  3. When computing the fake Metropolis HMC acceptance probability, use Rao-Blackwellized $a_{\mathrm{RB}}\left(q,q^{\star}\right)$ instead of the usual one. Unlike with true Metropolis HMC this changes the expected value because the fake Metropolis weights do not really match the multinomial acceptance probabilities.

Multinomial HMC, take two

Let's try that again.

The total cost of building the trajectory is $\frac{L}{\epsilon} $ and, provided L is short enough that movement is diffusive, effective samples per iteration is determined by the diffusion coefficient $D\left(q\right)=\mathbb{E_{\mathnormal{p}}}\left[l^{2}\left(q,p\right)\right] $ where $l^{2}\left(q,p\right) $ is the expected squared distance of the transition. With Metropolis HMC we had $l^{2}\left(q,p\right)=a\left(q,p\right)L^{2} $ but multinomial sampling allows for a more complex relationship. The previous bounds still apply

$$\frac{1}{\mathbb{E_{\mathnormal{q}}}\left[\mathbb{E_{\mathnormal{p}}}\left[l^{2}\left(q,p\right)\right]\right]}\leq\mathbb{E_{\mathnormal{q}}}\left[\frac{1}{\mathbb{E_{\mathnormal{p}}}\left[l^{2}\left(q,p\right)\right]}\right]\leq\mathbb{E_{\mathnormal{q}}}\left[\mathbb{E_{\mathnormal{p}}}\left[\frac{1}{l^{2}\left(q,p\right)}\right]\right]$$

(Of course, the challenge is finding the distribution of $l^{2}\left(q,p\right)$ )

The precise distance is difficult to keep track of but if we replace “metric distance” with “integration time” and approximate it as being constant inside a subtree then we get a “multinomial acceptance statistic”
$$\frac{l^{2}\left(q,p\right)}{L^2}\approx a_{M}\left(q,p\right)\overset{\mathrm{def}}{=}p_{1}+\frac{1}{4}p_{2}+\frac{1}{16}p_{3}+\frac{1}{64}p_{4}+\cdots$$
where $p_{1}$ is the probability of selecting from the last doubling, $p_{2}$ the probability for the second-to-last doubling, $p_{3}$ for the third-to-last doubling, and so on. This partitioning into doublings follows Stan's recursive sampling algorithm closely and thus is easy to compute on the fly. (In the first post of this thread I suggested using the acceptance probability of the last doubling as a proxy; it is almost equal and in fact the plots labeled “last doubling accept stat” were actually this slightly larger value.)

I'll end this post with a sampling cost vs accept stat plot similar to the “Optimizing the step size” figure 3b.

The lower and upper bounds in that paper were for Metropolis HMC with fixed integration time but NUTS chooses a power-of-two multiple of the step size. Changing the integration time by a factor of two may change the efficiency also by a factor of two so I've plotted the bounds as bands instead of lines.

The blue line is 100 dimensional iid normal, the orange is also 100 dimensional normal but correlated.

Top left accept stat is Stan's current one; top right is the one from the old pull request; bottom left is the "multinomial" one described in this post, and bottom right is the same but with the last doubling modified as I described in the first post of this thread.

cost

Effective sample size is measured with arviz.ess(method='mean') although since the marginals are normal bulk-ESS should be equal.

The lower and upper bounds in the top plots are those from the “Optimizing the step size” paper. The lower and upper bounds in the bottom plots are $\frac{1}{\epsilon\mathbb{E}\left[a_{M}\left(q,p\right)\right]} $ and $\frac{1}{\epsilon}\mathbb{E}\left[\frac{1}{a_{M}\left(q,p\right)}\right] $ estimated via a messy numerical calculation built on dubious assumptions about the shape and distribution of Hamiltonian trajectories.

@SteveBronder
Copy link
Collaborator

Thanks @nhuurre this is neat! Quick Q. For the last graph would it be better to show the 100 IID normal and correlated normal on two seperate graphs instead of breaking them up into 4 graphs? (i.e. all 100 IID normal on one graph color coded by the different acceptance statistics and all correlated normal on another graph).

Also are all of the graphs on the same y axis scale?

Am I intepreting that bottom left graph correctly that it tends to not do as well as the other accept stats almost everywhere? I'm not sure how to think about the blue line there zig zagging. If you could talk more about what you infer from the last graph that would be nice as I wasn't totally sure what I should be pulling it from it.

Also wanted to clarify, are the bands in that graph the approximate bounds of the step size? So for the bottom two its taking smaller step sizes at lower acceptance stats?

For the second to rao blackwellization in the last graph, does the final ESS / n_leapfrog have to do twice of leapfrogs because it computes both sides? I think I might just be misunderstanding that graph / math.

Also is this all written with Stan's HMC? If so would it be possible to see ESS/second? If not then I can understand leaving it out as raw speed is going to be different depending on the language.

Also if you have a script for all of this it might be neat to apply it to the models in posteriordb just to get an idea of how all this works across more models

Do you have a PR that combines these ideas together? It sounds like that is what you would like to have put into the algorithms

@nhuurre
Copy link
Contributor Author

nhuurre commented Jul 12, 2022

For the last graph would it be better to show the 100 IID normal and correlated normal on two seperate graphs instead of breaking them up into 4 graphs?

The different "accept stats" aren't exactly measuring the same thing so I wouldn't put them all on the x-axis of the same graph.
Putting them on the y-axis though is fine. The graph may be more interpretable when everything is presented as a function of step size (that's how they were generated anyway)

stepsize-acceptstat

The previous horizontal zig-zag is now seen as vertical zig-zag in the top left plot. It is due to step size resonances that cause a "surprising" increase in acceptance probability (and a small increase in ESS).

If you could talk more about what you infer from the last graph that would be nice as I wasn't totally sure what I should be pulling it from it.

Yeah, I'm not sure really. I guess to me it looks like there's not much difference between the different accept_stats, in every case the empirical lines are roughly consistent with the theoretical bounds, with the minimum cost / maximum efficiency predicted around $a\approx0.8$.

Do you have a PR that combines these ideas together?

I have a local branch that computes various accepts_stats. I pushed the last one on this branch https://github.com/nhuurre/stan/tree/ldlb-accept-stat

@betanalpha
Copy link
Contributor

betanalpha commented Aug 22, 2022

It's hard not to respond piecemeal because there are multiple misunderstandings threaded throughout that complicate inline responses. Let me try to clarify a few important points that I think may be causing some of the confusion.

Cost Functions For Optimizing Metropolis-Hastings Transitions

The preferred choice in the literature is the harmonic mean (aka mean residence time) although I haven't seen any explanation as to why.

The standard argument for this form doesn't really consider the diffusion limit at all. Instead it introduces a cost function based on how expensive it will be to generate an accepted proposal. At the initial point $q$ the probability of an acceptance is

$$ E_{q'} [ a(q', q) ] $$

and the number of rejections until an acceptance $R$ follows a geometric distribution with mean

$$ E[ R | q ] = \frac{ 1 }{ E_{q'} [ a(q', q) ] }. $$

The average number of rejections across all initial points is then that harmonic mean,

$$ E[R] = E[ E[R | q] ] = E_{q} \left[ \frac{ 1 }{ E_{q'} [ a(q', q) ] } \right]. $$

(Of course, we'd like to have L long enough to avoid diffusion but hopefully that won't change the optimality criterion—if there's an argument that justifies this cost measure without reference to a diffusion process then I've missed it.)

Betancourt et al (2014) doesn't make any mention of a diffusion limit in motivating its cost function. On page 3 it instead lays out the above argument, adding in the cost of generating each numerical trajectory in those intermediate rejections. Consequently the results to apply for any trajectory length within the assumption that the cost of intermediate rejections dominates the overall performance of the sampler.

This last assumption is not to be taken for granted. This cost function is missing any quantification of the effectiveness of an accepted proposal, instead treating all acceptances as the same. In cases where there is a related diffusion limit this makes sense, but there's no guarantee that it will generalize. That said, it does seem to capture the optimal behavior in practice even when Hamiltonian trajectories are long enough that they really can't be considered diffusive by any metric.

The Many Meanings of Rao-Blackwell

One of the sources of confusion here is the term "Rao-Blackwell" being used to describe similar but formally distinct operations.

For example "Rao-Blackwell" is often used to describe a Markov transition that generalizes the standard Metropolis-Hastings transition. These generalized transitions attempt auxiliary proposals after each rejection, drastically increasing the probability of an acceptance. Note that this also fundamentally changes the relationship between the proposal step size and the average acceptance probability, and hence what the optimal step size or optimal average acceptance probability will be.
The various Hamiltonian Monte Carlo samplers under discussion can kind of sort of be considered from this perspective, but it's complicated; see for example http://arxiv.org/abs/2012.14881v1.

"Rao-Blackwell" can also be used to describe Markov chain Monte Carlo estimators that take advance of multiple intermediate states generated in a proposal. For example consider Hamiltonian Monte Carlo. Any Hamiltonian Monte Carlo method will generate an entire numerical Hamiltonian trajectory with each state offering useful information. Even if the final Markov transition only uses one of these points (the end point for the original Hybrid Monte Carlo sampler, a random point based on the Boltzmann weights for multinomial sampler) those intermediate points can be used in the construction of Markov chain Monte Carlo estimators with a little bit of care.

When implemented correctly these Rao-Blackwellized Markov chain Monte Carlo estimators will be more precise than the standard Markov chain Monte Carlo estimators that use only the Markov chain states, which then modifies the relationship between the sampler configuration and the estimator effective sample sizes. In particular these Rao-Blackwellized estimators can afford a large step size, and hence lower cost, without compromising the effective sample size, and hence overall sampler performance.

The subtlety here is that there are now two notions of a "Rao-Blackwellized average acceptance probability estimator".

Notion One

One notion arises when one endeavors to estimate an appropriate optimization statistic for a given sampler -- for example the Hybrid Monte Carlo sampler or the sliced NUTS sampler or the multinomial sampler -- using the entire intermediate numerical Hamiltonian trajectory instead of just the initial and accepted states. This first requires, however, the construction of an optimization statistic appropriate to the desired sampler. The basis of this issue is that Stan currently uses an estimator for an optimization statistics appropriate for the Hybrid Monte Carlo sampler not the multinomial sampler, and this results in suboptimal behavior when trying to optimize the multinomial sampler.

In my original pull request I introduced an estimator of a modified optimization statistic appropriate to the multinomial sampler. This estimator uses all of the states in each numerical Hamiltonian trajectory and I demonstrated that it yielded better performance after adaptation. Because it uses all of the states in each numerical trajectory it could be considered a Rao-Blackwell estimator.

@nhuurre you introduced another estimator which you claimed, at least according to my understanding, estimated the same multinomial optimization statistic but with a smaller variance by squeezing more information from each numerical trajectory (or generating more states entirely). You haven't yet shown any calculations demonstrating these claims -- same expectation but smaller variance for yours -- and the numerical experiments have yet to directly compare my proposed method to yours. As I mentioned earlier in the thread the smoking gun for "Rao-Blackwellization" would be the same asymptotic sampler performance but faster convergence for the estimator with the smaller variance.

Notion Two

The second notion that we might consider is how to optimize performance when using Rao-Blackwellized Markov chain Monte Carlo estimators, that is instead of using

$$ \hat{f} = \frac{1}{N} \sum_{n = 1}^{N} f(q_n) $$

as the final Markov chain Monte Carlo output using

$$
\hat{f}{\mathrm{RB}} = \frac{1}{N} \sum{n = 1}^{N} \frac{ \sum_{z \in t_{n} } w(z) f(q(z)) }{ \sum_{z \in t_{n} } w(z) }
$$

where $t_{n}$ is the entire numerical trajectory used to generate the $n$th Markov transition.

The analyses in Gilks et al (1997) and Betancourt et al (2014) concern how to optimize Metropolis-Hastings samplers when the output is $\hat{f}$. When using $\hat{f}_{\mathrm{RB}}$ these analyses are not appropriate -- in general they will lead to step sizes that aren't aggressive enough.

That said one might consider using a modified optimization statistic that would lead to approximately optimal behavior for the $\hat{f}_{\mathrm{RB}}$ Markov chain Monte Carlo estimators, similar to using the Boltzmann weighted Metropolis acceptance probability to approximately optimize the multinomial sampler using the Metropolis-based results of Betancourt et al (2014).

Which Notion?

My previous claim is that the optimization statistic introduced by @nhuurre is well suited to the particular optimization problem of Notion 2. In other words I don't think that its expectation is actually equivalent to the multinomial optimization statistic, in which case it would not be a Rao-Blackwellized average acceptance probability estimator in the sense of Notion 1. Instead it would be, or so I claim, a Rao-Blackwellized average acceptance probability in the sense of Notion 2.

I agree that a proper Rao-Blackwellization in the sence of Notion 1 will always be better, but I don't think that there has been any verification that any of these optimization statistics can be considered Rao-Blackwellizations of others.

Empirical Results

The n_leapfrog / ESS verses average acceptance probability (really the "optimization statistic" since most of these aren't actually Metropolis acceptance probabilities" is great. I also think that it very clearly supports the claims of my original pull request.

In the top-left plot we see that for these two problems the current Stan behavior is optimized not for the default adapt_delta = 0.8 but rather a value closer to 0.5-0.6. This is more in line with the theoretical Hybrid Monte Carlo upper bound, which is optimized around 0.6, than the theoretical lower bound which is optimized around 0.8.

The top-right plot shows that, again at least for these problems, the modified optimization statistic in my original pull request results in optimal behavior right around adapt_delta = 0.8 in line with the current default value. This is consistent with the original empirical studies I included in that pull request showing nearly uniform improvement in performance for the default adaptation configuration. By carefully modifying the optimization statistic we better connect the behavior of the multinomial sampler with the theoretical bounds developed in Betancourt et al (2014) which then allows for more effective adaptation.

In my opinion the spikes exhibited in the "last doubling accept rate" in the lower-left would be far too fragile for practical application, and the "last doubling lower bound" gives similar results to the top-right within the resolution of the experiments. To flip the results around -- what if anything suggests that "last doubling lower bound" optimization statistic is better than the "Boltzmann weights"?

Odds and Ends

Stan's default step size adaptation window (term_buffer=50) is too short for the step size to converge.

Only one adaptation window uses a single term_buffer -- the first window in Phase II where both the stepsize and inverse metric are adapted. The goal of that first window is more to quickly construct a decent if crude guess at the inverse metric.

@betanalpha
Copy link
Contributor

Does GitHub's mathjax support need triple back tick blocks to render correctly? The docs imply that $$ blocks should be sufficient but it doesn't look like they're rendering.

@bob-carpenter
Copy link
Contributor

Does GitHub's mathjax support need triple back tick blocks to render correctly?

As far as I can tell, there's no way to get a displayed equation in Git flavored markdown. The double $$ does the right thing in quarto and Rmarkdown, but not in Git flavored markdown.

@rok-cesnovar
Copy link
Member

@bob-carpenter
Copy link
Contributor

Thanks for the tip, Rok. I didn't think it worked in markdown, because I was trying to use it like Michael. The trick is putting blank lines before and after. So really dumb and completely undocumented, just like the rest of markdown. Well, not undocumented, just misleadingly documented:

To add a math expression as a block, start a new line and delimit the expression with two dollar symbols $$.

No, that doesn't work, as this example shows:

abc
$$
e^x
$$
efg

It just produces

abc
$$
e^x
$$
efg

but if you offset the display math with blank lines,

abc 

$$
e^x
$$

def

it produces

abc

$$ e^x $$

def

@betanalpha
Copy link
Contributor

Thanks, @rok-cesnovar! The mathjax rendering doesn't seem to be perfect in some of the more elaborate equations but hopefully the edits are easier to read.

@nhuurre
Copy link
Contributor Author

nhuurre commented Aug 23, 2022

Github's MathJax is a huge pain in the neck. I had great difficulty getting my previous post render correctly in the preview and apparently some of it has now broken anyway.

@betanalpha
You forgot to reply to what I'd consider the most important part; the changes to your old PR:

  1. Metropolis adjusted weights instead of raw Boltzmann weights
  2. Rao-Blackwellized accept_prob for the proxy transitions

In my opinion (1) takes us closer to the current sampler behavior and the effect of (2) is not theoretically understood but could plausibly be an improvement.

If we can agree on this then at least some progress has been made.

Instead it introduces a cost function based on how expensive it will be to generate an accepted proposal.

This is highly misleading phrasing. The average cost per accepted proposal in a long chain is irrelevant; the "standard argument" you speak of only considers the cost to the first accepted proposal in a new chain. (Which is surprising because that's not how the sampler actually behaves.)

The average number of rejections across all initial points is then that harmonic mean,

The harmonic mean is the average number of rejections before the first acceptance across all initial points. That omission was a major stumbling block for me because for a long time I held the misunderstanding that these papers were minimizing the total number of rejections in a long-running chain (which, in retrospect, is obviously not a harmonic mean but at the time it seemed so self-evidently correct cost function that I did not stop and think.)

Betancourt et al (2014) doesn't make any mention of a diffusion limit in motivating its cost function.

Betancourt et al (2014) has only a brief motivation that does not adequately justify the use of harmonic mean rather than arithmetic mean. When questioned on this, Betancourt (2022, pers. comm.) referred to Roberts et al (1997) which does use diffusion limit and does not distinguish between harmonic and arithmetic mean.

adding in the cost of generating each numerical trajectory in those intermediate rejections. Consequently the results to apply for any trajectory length within the assumption that the cost of intermediate rejections dominates the overall performance of the sampler.
This cost function is missing any quantification of the effectiveness of an accepted proposal, instead treating all acceptances as the same.

Once again, this type of language makes it sound as if the quantity of interest is the asymptotic number of acceptances, but that would be the arithmetic mean of the acceptance probability.

If the argument is (as I now suspect it is) that the asymptotic variance of the MCMC estimate is proportional to the variance of the lengths of the sequences of rejections then you should say the important thing is not the mean of the geometric distribution but its variance-to-mean ratio which coincidentally happens to be equal to the mean.

you introduced another estimator which you claimed, at least according to my understanding, estimated the same multinomial optimization statistic but with a smaller variance by squeezing more information from each numerical trajectory (or generating more states entirely). You haven't yet shown any calculations demonstrating these claims -- same expectation but smaller variance

I omitted the calculation for multinomial sampling but I did show one for Metropolis-Hasting and Hybrid Monte Carlo schemes, and that is what "the proposed change to the old PR" number 2 uses.
Here it is again:
The acceptance probability for a proposed transition from $q \to q^\star$ is $1\wedge \frac{w^\star}{w}$. But note that when you see that proposal you know that at some future time the sampler will be at $q^\star$ and propose $q^\star \to q$. The long-term average acceptance probability can be estimated by the weighted average of these two known transitions

$$ \frac{w}{w+w^\star}a\left(q\to q^\star\right) + \frac{w^\star}{w+w^\star}a\left(q^\star\to q\right) = \frac{w}{w+w^\star}\left(1\wedge \frac{w^\star}{w}\right) + \frac{w^\star}{w+w^\star}\left(1\wedge \frac{w}{w^\star}\right) = \frac{2\left(w\wedge w^\star\right)}{w+w^\star} $$

Do you agree that this is a valid Rao-Blackwellization?

Now, for multinomial sampling Stan currently uses "Gelman-Hoffman" acceptance statistic

$$ a_{\mathrm{GH}}\left(q_0\right) = \frac{1}{N-1}\sum_{j\neq 0}1\wedge\frac{w_j}{w_0} $$

The obvious "Rao-Blackwellization" is to sum over all potential initial points in the trajectory

$$ \sum_i \frac{w_i}{W} \frac{1}{N-1}\sum_{j\neq i}1\wedge\frac{w_j}{w_i} = \frac{1}{W\left(N-1\right)}\sum_{i,j\neq i}w_i\wedge w_j $$

But that's inconvenient because it requires summing all N(N-1)/2 pairs. Instead, the plots were made with a symmetrized Gelman-Hoffman statistic

$$ a_{\mathrm{GH*}}\left(q_0\right) = \frac{1}{N-1} \sum_{j \neq 0}\frac{2\left(w_0\wedge w_j\right)}{w_0+w_j} $$

This has the same mean because it has the same "Rao-Blackwellization" (relabel the indices in one sum):

$$ \sum_i \frac{w_i}{W}\frac{1}{N-1}\sum_{j\neq i}\frac{2\left(w_i\wedge w_j\right)}{w_i+w_j} = \frac{1}{W\left(N-1\right)}\left(\sum_{i,j\neq i}\frac{w_{i}\left(w_{i}\wedge w_{j}\right)}{w_{i}+w_{j}}+\sum_{i,j\neq i}\frac{w_{i}\left(w_{i}\wedge w_{j}\right)}{w_{i}+w_{j}}\right) = \frac{1}{W\left(N-1\right)}\sum_{i,j\neq i}w_i\wedge w_j $$

When using $\hat{f}_{\mathrm{RB}}$ these analyses are not appropriate -- in general they will lead to step sizes that aren't aggressive enough.

It's not clear to me how these analyses depend on the specifics of the MCMC estimators used. Could you elaborate on why that makes a difference and what the analysis looks like for $\hat{f}_{\mathrm{RB}}$ ?

This is more in line with the theoretical Hybrid Monte Carlo upper bound, which is optimized around 0.6, than the theoretical lower bound which is optimized around 0.8.

I've tried to follow Betancourt et al. (2014) who say the lower bound is maximized at a=0.6 and the upper bound is minimized at a=0.8. However this terminology is confusing because in my opinion their plot is upside down, showing inefficiency rather than (more natural) efficiency.

To flip the results around -- what if anything suggests that "last doubling lower bound" optimization statistic is better than the "Boltzmann weights"?

All else equal, longer transitions are better than shorter ones and I think the acceptance statistic should take that into account somehow. "Boltzmann weights" treats all points equally while "last doubling lower bound" has a strong preference for the far end of the trajectory. That makes it more robust in pathological examples like one I posted on the forums way back.

Only one adaptation window uses a single term_buffer -- the first window in Phase II where both the stepsize and inverse metric are adapted.

term_buffer is the length of the Phase III, the final tuning of the step size after the metric has been adapted. I think what's happening is that the volatility of accept_stat does not allow the step size to settle down and biases it downward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants