11.11 Appendix B: Intro to Bayesian MCMC Models
Just additional background information, not part of exam syllabus
Definition 11.1 Markov Chain
Random process where the transition to the next state depends only on its current state and not on prior states
A Markov chain \(X_t\) for \(t=1,2,...\) is a sequence of vectors satisfying the property that
\[\Pr(X_{t+1} = x \mid X_1 = x_1, X_2 = x_2,...,X_t = x_t) = \Pr(X_{t+1} \mid X_t = x_t)\]
Key properties of Markov chain for Bayesian MCMC
Ergodic class of Markov chain, for which vectors \(\{X_t\}\) approaches a limiting distribution
As \(T\) increases, the distribution of \(\{X_t\}\) for all \(t>T\) approaches a unique limiting distribution
Markov chains used in Bayesian MCMC (e.g. Metropolis Hastings algorithm) are members of Ergodic class
Let \(x\) be a vector of observations and let \(y\) be a vector of parameters in a model
In Bayesian MCMC, the Markov chain is defined in terms of the prior distribution \(p(y)\) and the conditional distribution \(f(x \mid y)\)
The limiting distribution is the posterior distribution \(f(y \mid x)\)
If we let the chain run long enough, the chain will randomly visit all states with frequency that is proportional to their posterior probabilities
The operative phrase above is long enough, which means in practice:
Develop algorithm for obtaining a chain that is long enough as quickly as possible
Develop criteria for being long enough
11.11.1 How Bayesian MCMC works in practice
User specifies \(p(y)\) and \(f(x \mid y)\)
User selects a starting vector \(x_1\)
Using computer simulation, runs the Markov chain through a sufficiently large number, \(t_1\), of iterations
This first phase of the simulation is called the “adaptive” phase
The algorithm is automatically modified to increase its efficiency
See below on the Metropolis-Hasting alogrithm
User runs an additional \(t_2\) iterations
This is the “burn-in” phase
\(t_2\) is selected to be high enough so that a sample taken from subsequent \(t_3\) periods represents the posterior distribution
User then runs an additional \(t_3\) iterations and then takes a sample, \(\{ x_t \}\) from the \((t_2 + 1)\)th step to the \((t_2 + t_3)\)th step to represent the posterior distribution \(f(y \mid x)\)
From the sample, we can constructs various statistics of interest that are relevant to the problem addressed by the analysis
11.11.2 Metropolis-Hastings Algorithm
Most common algorithms for generating Bayesian Markov chains are variants of the Metropolis-Hastings algorithm
Given: \(p(y)\) and \(f(x \mid y)\)
The algorithm introduces a 3rd distribution \(J(y_t \mid y_{t-1})\):
- “Proposal” or “jumping” distribution
Given a parameter vector \(y_{t-1}\) the algorithm generates a Markov chain by the following steps:
Select a candidate value \(y^*\), at random from the proposal distribution \(J(y_t \mid y_{t-1})\)
Computer the ratio
\[R \equiv R_1 \times R_2 = \dfrac{f(x \mid y^*) \cdot p(y^*)}{f(x \mid y_{t-1}) \cdot p(y_{t-1})} \times \dfrac{J(y_{t-1} \mid y^*)}{J(y^* \mid y_{t-1})}\]
Select \(U\) at random from \(U(0,1)\) distribution
If \(U < R\) then set \(y_t = y^*\), else \(y_t = y_{t-1}\)
Remark.
\(R_1\) represents the ratio of the posterior probability of the proposal \(y^*\) to the posterior probability of \(y_{t-1}\)
The higher the value of \(R_1\), the more likely will be accepted into the chain
- Regardless of how the proposal density distribution is chosen, the distribution of \(y_t\) can be regarded as a sample from the posterior distribution, after a suitable burn-in period
Example
We can look at trace plots to look at the value of the parameter as the chain progresses
More on how it might break in the paper…
The key is to be able to scale the proposal distribution to minimize auto correlation
- This is difficult with many parameters
Minimizing autocorrelation in Metropolis-Hasting
A good statistics to look at is the acceptance rate of \(y^*\)
50% is near optimal for a one parameter model and the rate decreases to about 25% as we increase the number of parameters in the model
There are methods to automatically adjust the proposal density function in Metropolis-Hastins
All these have been mechanized in software like JAGS and STAN
The phase of generating the Markov chain where the proposal density function is optimized is called the “adaptive” state (as discussed above)
As models become more complex, adaptive MCMC may not be good enough to eliminate the auto correlation
Theory on Markov chain convergence will still hold but there is no guarantee on how fast it will converge
If there is significance auto correlation after the best scaling effort, the next best practice is to increase \(t_3\) until there are sufficient number of ups and downs in the trace plot and then take a sample of the \((t_1 + t_2 +1)\)th to \((t_1 + t_2 + t_3)\)th iterations
- This process is known as “thinning”
11.11.3 Usecase Example for Actuaries
Look at how the posterior distribution of \(\mu\) might be of interest
Consider a fitted lognormal distribution for a set of claims to determine the cost of an XS layer
Given \(\mu\) and \(\sigma\) of the lognormal we can determine the cost of an XS layer (see Klugman, Panjer, and Willmot and the actuar package)
As the posterior distribution of \(\mu\) reflects the parameter risk in the model, it is also possible to reflect the parameter risk in the expected cost of a layer by calculating the expected cost of the layer for each \(\mu\) in the simulated posterior distribution
It is possible to simulate an actual outcome of a loss \(X\) in a layer given \(\mu\) in the posterior distribution
The distribution \(X\) calculated this way reflects both the parameter risk and process risk in the model
11.11.4 Bayesian interence Using Gibbs Sampling (BUGS)
WinBUGS, JAGS are examples of software using Gibbs sampling
Sample work process:
Read data into R and call JAGS script with R package “runjags”
Fetch the sample of the posterior back into R to calculate statistics of interest
JAGS perform the Metropolis-Hasting algorithm we described above
It also has a number of convergence diagnostics