Markov Chain Monte Carlo Methods Explained

Markov Chain Monte Carlo
→ Gibbs Sampling → Metropolis–Hastings
→ Hamiltonian Monte Carlo → Reversible-Jump MCMC
Francesco Casalegno

Francesco Casalegno – Markov Chain Monte Carlo
1. Motivation
2. Basic Principles of MCMC
3. Gibbs Sampling
4. Metropolis–Hastings
5. Hamiltonian Monte Carlo
6. Reversible-Jump Markov Chain Monte Carlo
7. Conclusion
Outline
2

● “Computing in Science and Engineering” put MCMC among the Top 10 Algorithms of
the 20th
Century, (together with Fast Fourier Transform, Quicksort, Simplex Method, …)
● MCMC methods are used to draw samples from complex distributions: X1
… XM
~ p(x)
○ Why complex distributions? → If not, use Inverse Transform Sampling or Rejection Sampling
→ p(x) highly dimensional, multi-modal, known up to const ...
○ Why Markov Chain? → Samples drawn in a sequence!
○ Why Monte Carlo? → Samples used to approximate pdf or compute mean/variance/...
● OK, but when do we find these complex distributions?
○ Typical scenario: sample from posterior = use observations (y1
… yN
) = Y to make inference on θ
○ So we want to sample θ ~ p(θ|Y) = p(Y|θ) p(θ) / p(Y) ∝ p(Y|θ) p(θ)
■ θ is often highly dimensional
■ The normalization constant, i.e. the evidence p(Y) = p(Y|θ) p( θ) , is computationally intractable
○ MCMC are an essential tool to make inference with Bayesian Networks
● In the next slides we see Bayesian Networks where MCMC can be applied with success!
1. Motivation
4

● Bayesian Networks (aka Belief Network) are powerful models representing variables and
their conditional dependencies in a DAG (directed acyclic graph).
● Notation
● Inference on unobserved variables is done by computing posterior distributions
● Posterior distributions are computed using the following tools
○ Law of Total Probability
○ Chain Rule (aka Product Rule)
○ MCMC methods
1. Motivation
5
observed quantity unobserved variables fixed parameters
Bayes’ Theorem

1. Example: Hierarchical Regression
6
● Observations
○ Different countries c = 1 … 4, different number of samples Nc
○ Each sample is xi
= mother longevity, yi
= child longevity
● Model
○ Linear regression, but samples are given per country...
■ No Pool: treat each country independently, fit 4 independent θc
■ Complete Pool: forget about the country, fit one θ on all together
■ Hierarchical Regression: there are 4 different θc
, but related!
→ Best approach, in particular for countries with few samples (c=3)!
● How can we make inference on θ1
… θC
, μθ
, and σθ
?
○ Note: in Bayesian Nets, all parameters of interest have priors!

● Observations
○ Number of coal mines fatalities yt
during year t = 1900 ... 1960
● Model
○ Number of fatalities follows a Poisson law
○ Fatality rate changed (e.g. new OSH law) at some point
● How can we compute posteriors for ν, λ1
, λ2
?
○ Year ν when the rate changed
○ Fatality rate λ1
before changepoint
○ Fatality rate λ2
after changepoint
1. Example: Mine Fatality (Change-Point Model)
7

1. Example: Latent Dirichlet Allocation
● Observations: words from D documents
● Model
○ Assume there are T topics in total
○ Assume Bag-Of-Words (only word counts matter, not order)
○ φt
distribution of words in topic t ∊ {1 … T}
○ θd
distribution of topics in document d∊ {1 … D}
○ zd,n
topic of word n ∊ {1 … Nd
} within document d∊ {1 … D}
○ wd,n
word appearing at position n ∊ {1 … Nd
} of doc d∊ {1 … D}
● How to automatically discover (infer posterior distribution)
○ topics content in terms of words associated with them?
○ document content in terms of topic distribution? 8

● As we have seen, to use Bayesian Networks we need to sample from θ ~ p(θ|Y).
But MCMC methods are generic: in the following we just talk about sampling from p.
● MCMC methods build a Markov Chain of samples X1
, X2
, … converging to p. We need:
1. An initial sample x0
2. A simple way to draw a new Xn+1
given Xn
= xn
(i.e. the Markov process)
3. A mathematical proof that, for n large enough, the process generates samples Xn
~ p
We will see how different MCMC methods differ in the way they draw Xn+1
given Xn
= xn
● In this way, we draw X1
, X2
, … ~ p and we then compute Monte Carlo approximations like
And the error in IM
? X1
, X2
, … are correlated, so it is worse than standard Monte Carlo!
M* is known as effective sample size, and ρk
is the k-lag autocorrelation of X1
, X2
, …
10

X1
, X2
, … generated by MCMC
● wait for converge to p: discard burn-in!
● pdf approximation is worse (effective M*!)
● strong autocorrelation of samples
11
X1
, X2
, … drawn i.i.d. from p
● looks like noise
● pdf is approximation is quite good
● no autocorrelation

● But how do draw a new Xn+1
given Xn
= xn
?
○ There is no single solution, depending on the situation we can use a different MCMC method!
○ Each MCMC method has its own way of drawing a new Xn+1
given Xn
= xn
● In these slides we will present all most important MCMC methods
○ Gibbs Sampling
○ Metropolis–Hastings
○ Hamiltonian Monte Carlo
○ Reversible-Jump MCMC
12

3. Gibbs Sampling
● Gibbs sampling is a MCMC method used to draw from a multivariate distribution p when
1. sampling from the joint distribution p(x) = p(x1
… xD
) is difficult → so we need MCMC!
2. sampling from the (univariate) conditionals p(x1
|x2
… xD
), p(x2
|x1
x3
… xD
), …, p(xD
|x1
… xD-1
) is easy
● Algorithm
→ Choose an initial point x0
and find a way to draw from conditionals (e.g. by inverse sampling)
→ For n=0,...
→ Draw x1
n+1
~ p(x1
n+1
|x2
n
… xD
n
)
→ Draw x2
n+1
~ p(x2
n+1
|x1
n+1
x3
n
… xD
n
)
…
→ Draw xD
n+1
~ p(xD
n+1
|x1
n+1
… xD-1
n+1
)
→ Set xn+1
= (x1
n+1
… xD
n+1
)
14

● Let us sample from a bivariate normal distribution.
3. Gibbs Sampling: Example
15

● The joint posterior probability is
● Direct sampling from these multivariate, mixed
(discrete-continuous) distribution would be too hard!
→ Use Gibbs sampling, draw from conditional posterior!
○ The posterior for λ1
is
○ The posterior for λ2
is
○ The posterior for ν is a with
from which we can draw using Inverse Transform Sampling.
3. Gibbs Sampling: Mine Fatality
16

1. Unlike other MCMC methods, xn+1
is always
accepted as next step (no rejection)
2. Useful to treat very highly dimensional
problems
3. Useful if we have both continuous and
discrete components, to work with fully
discrete/continuous separate conditionals
1. All the conditionals must be known
→ often known only up to normalizing const!
2. Must know how to sample from conditionals
→ if it is hard, sample from conditionals with
another MCMC method such as
“Metropolis-within-Gibbs”
3. If components are strongly correlated, the
Markov chain converges slowly and has
highly auto-correlated samples
17
3. Gibbs Sampling: Pros and Cons

4. Metropolis–Hastings
● Metropolis–Hastings generates a chain of samples from p by using the following ideas.
○ Draw a new candidate x*n+1
for Xn+1
given Xn
= xn
using some proposal distribution Q(x*n+1
|xn
)
○ Accept the candidate (xn+1
= x*n+1
) with some acceptance prob. A(x*n+1
,xn
), otherwise reject.
● Algorithm
and a proposal distribution Q(x*n+1
|xn
)
→ For n=0,...
→ Draw new candidate x*n+1
~ Q(x*n+1
|xn
)
→ Compute acceptance probability
→ Accept candidate with probability A(x*n+1
,xn
).
If candidate is rejected, go back to draw new candidate.
Note: to compute the acceptance prob. we only need to know p up to a multiplicative const → typical Bayesian posterior!
● How do we choose the proposal distribution Q(x*n+1
|xn
) ?
○ A common choice x*n+1
~ Normal(xn
, σ2
)
○ This is called Random Walk Metropolis, as we can also write x*n+1
= xn
+ ε with ε~Normal(0, σ2
)
○ Large σ2
→ low auto-correlation between samples (“big jumps”), but high rejection rate
○ Small σ2
→ high auto-correlation between samples (“small jumps”), but low rejection rate
○ If we use a symmetric proposal distribution (e.g. Q = Normal), we have
○ This is called Metropolis method, historically invented before Metropolis–Hastings
○ But sometimes we need an asymmetric proposal, e.g. for 1-tailed target distributions (e.g. p = Gamma(α, β))
19

● Let us sample from a bivariate normal distribution using a Normal proposal distribution.
4. Metropolis–Hastings: Example
20

4. Metropolis–Hastings: Covariation Model
21
● The posterior distribution is
which does not look like anything familiar.
● Using Inverse Transform or Rejection Sampling would
be difficult in this case. So we use Metropolis-Hastings.
○ p(ρ|y1:N
) is known up to a constant, and that is OK
○ Q(ρ*n+1
|ρn
) cannot be a Normal as our domain is bounded
→ Cannot use Random Walk Metropolis!
○ Use Q(ρ*n+1
|ρn
) = TruncatedNormal-1
+1
: asymmetric proposal
→ Pdf of TruncatedNormal-1
+1
will appear in acceptance prob!

1. Works also if we know p only up to a
multiplicative constant
○ Can sample from Bayesian posterior
w/o calculating
2. Can be used within Gibbs sampling:
○ Gibbs splits the joint into conditionals
○ Sample x1
n+1
~ p(x1
n+1
|x2
n
), x2
n+1
~ p(x2
n+1
|x1
n+1
)
using Metropolis-Hastings
3. Can be used when it is not practical to
derive all conditional posteriors
1. Choice of the best proposal distribution?
2. Choice of variance of proposal distribution?
○ too small → high autocorrelation
○ too large → high rejection rate
22
4. Metropolis–Hastings: Pros and Cons

5. Hamiltonian Monte Carlo
24
● Hamiltonian Monte Carlo has two advantages with respect to other MCMC methods
○ Little or no autocorrelation of samples
○ Fast mix-in, i.e. the chain immediately converges to distribution p
● Hamiltonian Monte Carlo is based on the Hamiltonian (total energy) H(x, v) = U(x) + K(v)
○ Imagine a ball in a space with potential energy U(x) = - log p(x) and put the ball in initial position xn
○ Give the ball an initial random velocity v ~ q and define its kinetic energy K(v) = - log q(v)
○ Compute the trajectory for a time T, then take the final position: x(T) = xn+1
● Algorithm
and a velocity distribution q(v)
→ For n=0,...
→ Set the initial position to x(t=0) = xn
→ Draw a new random initial velocity v(t=0) ~ q(v)
→ Numerical integrate the trajectory with total energy H(x, v) = -log p(x) - log q(v) for a time T
→ Set xn+1
= x(t=T)
● How do we choose the distribution for the velocity q(v) ?
○ A common choice is v~Normal(0, Σ) so that the kinetic energy reads K(v) = ½ vT
Σ-1
v
○ If we have an understanding of p(x) we can choose Σ in a smart way, otherwise just set Σ = σ2
I

● In a system with energy H(x, v) = U(x) + K(v), position x and velocity v evolve according to
● In most cases these equations cannot be solved exactly, so we use a numerical scheme
○ Choose a discrete time step τ
○ Compute numerical solution using Leapfrog Method (or another symplectic method)
○ Energy H(x, v) = U(x) + K(v) should be preserved over time, but we use a numerical discretization…
○ Symplectic methods are good because they preserve H(x, v) up to O(τs
) , with s=2 for Leapfrog Method
○ When using numerical methods to compute trajectories, accept xn+1
= x(t=T) with acceptance probability
Notice that H(xn+1
, vn+1
) = H(xn
, vn
) + O(τs
) so the acceptance probability is ≈ 1 for τ small enough.
5. Hamiltonian Monte Carlo: Trajectories
25

● Let us sample from a bivariate normal distribution.
5. Hamiltonian Monte Carlo: Example
26

1. Best method for continuous distributions
2. Samples have almost 0 autocorrelation
3. Only requires to know only p up to a const.
4. Can be extended to have velocity
depending on the location, q = q(v|x), but
than K = K(x, v)
1. Choice of symplectic integrator and τ?
○ τ too small → slow integration
○ τ too large → higher rejection rate
→ adaptive methods automatically choose τ
2. Choice of q(v) ?
If q(v) = Normal(0, Σ), choice of Σ?
3. Choice of integration time T?
○ T too small → may have correlation
○ T too large → Hamiltonian trajectories
are closed, so time waste
→ NUTS method automatically chooses T!
4. Must evaluate derivatives p’(x) and q’(v)
5. Works only for continuous distributions 27
5. Hamiltonian Monte Carlo: Pros and Cons

● Reversible-Jump MCMC extends MCMC methods to the case where the variables space
has unknown/variable number of dimensions.
○ Hierarchical Regression. In the example we fit lines, i.e. we used θ ∊ ℝ2
. We could also decide to
use polynomials of another degree k, so that θ ∊ ℝk+1
→ how do we choose k?
○ Change-Point Model. In the example we assumed that the
mine fatality rate was changing at some point.
We could also assume that the rate changed k times, so we
need inference on the change-points ν1
… νk
as well as on the
rates λ1
… λk+1
→ how do we choose k?
● Reversible-Jump MCMC is a powerful
method for model selection!
○ Also works for multiple
hyper-parameters k1
… km
6. Reversible-Jump MCMC
29

● Consider the meta-space where k is the
model index, and dk
is the dimension of that space
○ k is treated as just another variable in the meta-space
○ For our change-point model, k = n. of change points, dk
= 2k+1
● How do we jump from dimension dk
to dk’
?
○ Sample an extra random variable u ~ Q(u)
○ If dk
< dk’
it is called “birth” — If dk
> dk’
it is called “death”
● Algorithm
a. draw jump u ~ Q(u)
b. compute proposal xn+1
*
= g(xn
, u)
c. compute reverse jump u*
s.t. xn
= g(xn+1
*
, u*
)
d. accept proposal with acceptance probability
6. Reversible-Jump MCMC
30

Conclusions
1. Bayesian Networks are a powerful tool of Machine Learning and Statistical Modelling.
2. Bayesian Networks use MCMC to sample from computationally intractable posteriors.
3. Gibbs Sampling reduces reduces drawing from hard joint posterior into easy conditionals
4. Metropolis-Hastings is useful when posterior has no closed form/is known up to const.
5. Hamiltonian Monte Carlo is best choice for continuous case: low correlation, low rejection
6. Reversible-Jump MCMC is an extension used when n. of parameters is unknown/variable
32

Markov Chain Monte Carlo Methods Explained

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Markov Chain Monte Carlo Methods Explained

Similar to Markov Chain Monte Carlo Methods Explained (20)

More from Francesco Casalegno

More from Francesco Casalegno (8)

Recently uploaded

Recently uploaded (20)

Markov Chain Monte Carlo Methods Explained