11.11 Appendix B: Intro to Bayesian MCMC Models

Just additional background information, not part of exam syllabus

Definition 11.1 Markov Chain

Random process where the transition to the next state depends only on its current state and not on prior states

A Markov chain \(X_t\) for \(t=1,2,...\) is a sequence of vectors satisfying the property that

\[\Pr(X_{t+1} = x \mid X_1 = x_1, X_2 = x_2,...,X_t = x_t) = \Pr(X_{t+1} \mid X_t = x_t)\]

Key properties of Markov chain for Bayesian MCMC

  • Ergodic class of Markov chain, for which vectors \(\{X_t\}\) approaches a limiting distribution

    As \(T\) increases, the distribution of \(\{X_t\}\) for all \(t>T\) approaches a unique limiting distribution

  • Markov chains used in Bayesian MCMC (e.g. Metropolis Hastings algorithm) are members of Ergodic class

Let \(x\) be a vector of observations and let \(y\) be a vector of parameters in a model

  • In Bayesian MCMC, the Markov chain is defined in terms of the prior distribution \(p(y)\) and the conditional distribution \(f(x \mid y)\)

  • The limiting distribution is the posterior distribution \(f(y \mid x)\)

  • If we let the chain run long enough, the chain will randomly visit all states with frequency that is proportional to their posterior probabilities

The operative phrase above is long enough, which means in practice:

  1. Develop algorithm for obtaining a chain that is long enough as quickly as possible

  2. Develop criteria for being long enough

11.11.1 How Bayesian MCMC works in practice

  1. User specifies \(p(y)\) and \(f(x \mid y)\)

  2. User selects a starting vector \(x_1\)

    Using computer simulation, runs the Markov chain through a sufficiently large number, \(t_1\), of iterations

    This first phase of the simulation is called the “adaptive” phase

  3. User runs an additional \(t_2\) iterations

    This is the “burn-in” phase

    \(t_2\) is selected to be high enough so that a sample taken from subsequent \(t_3\) periods represents the posterior distribution

  4. User then runs an additional \(t_3\) iterations and then takes a sample, \(\{ x_t \}\) from the \((t_2 + 1)\)th step to the \((t_2 + t_3)\)th step to represent the posterior distribution \(f(y \mid x)\)

  5. From the sample, we can constructs various statistics of interest that are relevant to the problem addressed by the analysis

11.11.2 Metropolis-Hastings Algorithm

Most common algorithms for generating Bayesian Markov chains are variants of the Metropolis-Hastings algorithm

Given: \(p(y)\) and \(f(x \mid y)\)

The algorithm introduces a 3rd distribution \(J(y_t \mid y_{t-1})\):

  • “Proposal” or “jumping” distribution

Given a parameter vector \(y_{t-1}\) the algorithm generates a Markov chain by the following steps:

  1. Select a candidate value \(y^*\), at random from the proposal distribution \(J(y_t \mid y_{t-1})\)

  2. Computer the ratio

\[R \equiv R_1 \times R_2 = \dfrac{f(x \mid y^*) \cdot p(y^*)}{f(x \mid y_{t-1}) \cdot p(y_{t-1})} \times \dfrac{J(y_{t-1} \mid y^*)}{J(y^* \mid y_{t-1})}\]

  1. Select \(U\) at random from \(U(0,1)\) distribution

  2. If \(U < R\) then set \(y_t = y^*\), else \(y_t = y_{t-1}\)

Remark.

  • \(R_1\) represents the ratio of the posterior probability of the proposal \(y^*\) to the posterior probability of \(y_{t-1}\)

  • The higher the value of \(R_1\), the more likely will be accepted into the chain

  • Regardless of how the proposal density distribution is chosen, the distribution of \(y_t\) can be regarded as a sample from the posterior distribution, after a suitable burn-in period

Example

  • We can look at trace plots to look at the value of the parameter as the chain progresses

  • More on how it might break in the paper…

  • The key is to be able to scale the proposal distribution to minimize auto correlation

    • This is difficult with many parameters

Minimizing autocorrelation in Metropolis-Hasting

  • A good statistics to look at is the acceptance rate of \(y^*\)

  • 50% is near optimal for a one parameter model and the rate decreases to about 25% as we increase the number of parameters in the model

  • There are methods to automatically adjust the proposal density function in Metropolis-Hastins

  • All these have been mechanized in software like JAGS and STAN

  • The phase of generating the Markov chain where the proposal density function is optimized is called the “adaptive” state (as discussed above)

As models become more complex, adaptive MCMC may not be good enough to eliminate the auto correlation

  • Theory on Markov chain convergence will still hold but there is no guarantee on how fast it will converge

  • If there is significance auto correlation after the best scaling effort, the next best practice is to increase \(t_3\) until there are sufficient number of ups and downs in the trace plot and then take a sample of the \((t_1 + t_2 +1)\)th to \((t_1 + t_2 + t_3)\)th iterations

    • This process is known as “thinning”

11.11.3 Usecase Example for Actuaries

Look at how the posterior distribution of \(\mu\) might be of interest

Consider a fitted lognormal distribution for a set of claims to determine the cost of an XS layer

Given \(\mu\) and \(\sigma\) of the lognormal we can determine the cost of an XS layer (see Klugman, Panjer, and Willmot and the actuar package)

As the posterior distribution of \(\mu\) reflects the parameter risk in the model, it is also possible to reflect the parameter risk in the expected cost of a layer by calculating the expected cost of the layer for each \(\mu\) in the simulated posterior distribution

It is possible to simulate an actual outcome of a loss \(X\) in a layer given \(\mu\) in the posterior distribution

The distribution \(X\) calculated this way reflects both the parameter risk and process risk in the model

11.11.4 Bayesian interence Using Gibbs Sampling (BUGS)

WinBUGS, JAGS are examples of software using Gibbs sampling

Sample work process:

  1. Read data into R and call JAGS script with R package “runjags”

  2. Fetch the sample of the posterior back into R to calculate statistics of interest

JAGS perform the Metropolis-Hasting algorithm we described above

It also has a number of convergence diagnostics