Exam ST9 Notes

Module 19: Fitting Models

Module 19 Objective

Fitting a Distribution to Data

Method of Moments

Univariate Distribution
Copulas
Features of the method of moments

Maximum Likelihood

Features of MLE
Univariate Distribution
N-dimensional copulas
Choosing between candidate copulas

Fitting a Model to Data

Least Squares Regression

Ordinary Least Squares
Generalized Least Squares
Test of the overall fit
Test of the individual regression coefficients

Methods based on Likelihood Functions

Likelihood ratio test
Information Criteria (IC)

Principal Component Analysis
Singular Value Decomposition

SVD Application Example

Model Selection

Model Validation

Module 19 Objective

Recommend a specific choice of model based on the results of both quantitative and qualitative analysis of financial or insurance data

Short list a number of possible (candidate) models (based on certain features of the data)

E.g. tail dependence shown in the data will indicate which choices of copula

This module focus on how each candidate model might be fitted to the data set

Select the method of fitting
Evaluate the goodness of fit
1. For the model as a whole
2. For each of the explanatory variables
3. Use both quantitative and qualitative methods
Final model selection

First consider the fitting of a distribution to a given data set, then go on to look at the fitting of more complex models to a data set

Exam note:

Key objective is to be able to recommend a specific choice of model or models for further analysis of specific problems
Recommendation needs to be made on the basis of a variety of quantitative measures as well as qualitative analysis

Fitting a Distribution to Data

Method of Moments

Univariate Distribution

Parameterization of univariate distribution

Establish the parameters of a distribution empirically by looking at the sample moments and equating these to the population (or true) moments

e.g. for a distribution with 3 parameters and $n$ data points $x_1,...,x_n$
- Set $\mathrm{E}(X) = \dfrac{1}{n} \sum x_i$ ; $\mathrm{E}(X^2) = \dfrac{1}{n} \sum x_i^2$ ; $\mathrm{E}(X^3) = \dfrac{1}{n} \sum x_i^3$
- Solve the equations simultaneously to obtain the parameter values needed to specify the distribution

Copulas

Parameterization of copulas

Generally this is done by using estimates of the rank correlation of the data

Set the true, underlying correlation of the copula to = these estimated correlations
Specific circumstances will influence the choice of whether to use Kendall’s $\tau$ or Spearman’s $\rho$

For Archimedean copulas

Gumbel, Clayton, and Frank are all characterized by $\alpha$ via their generators

$\therefore$ estimate the value of $\alpha$ based on the observed data
See Module 18 for relationships between $\alpha$ and $\tau$
Estimate the value of $\tau$ with the data and use the formulas to set it equal to the expression of $\alpha$ and then solve

e.g. Gumbel: $\tau = 1 - \dfrac{1}{\alpha} \Rightarrow \alpha = \dfrac{1}{1 - \tau}$

Features of the method of moments

Advantages

Generally more straightforward to use than other alternatives

Disadvantages

Parameters not necessarily the most likely ones
Parameter values may be outside their acceptable ranges

Maximum Likelihood

Features of MLE

Advantages

Only generates parameter values that are within the acceptable ranges
Any bias in the parameter estimates reduces as the number of observations increases
Distribution of each parameter estimate tends toward the normal distribution

Univariate Distribution

Parameterization of Univariate Distribution

Likelihood function:

Expresses the joint probability of the actual observations ( $x_1,...,x_T$ ) occurring, given the choice of candidate distribution
Maximize (w.r.t. each parameter) the log likelihood function

$\ln L = \sum \limits_{t=1}^T \ln f(x_t)$
- $f(x)$ is the marginal PDF of $X$
- Effectively maximizing the joint probability $L=\prod_{t=1}^T f(x_t)$
- Involve solving several simultaneous equations of the form $\partial (\ln (L)) / \partial p_i = 0$ where $p_i (i=1,...,N)$ is one of the $N$ parameters of the candidate distribution

N-dimensional copulas

Parameterization of N-dimensional copulas

For MLE we require the density function of the copula $c(u_1,...,u_N) = \dfrac{\partial^N C(u_1,...,u_n)}{\partial u_1 ... \partial u_N}$
- $N$ is the number of variables (= dimensions of the copula)
- $u_n = F(x_n)$ for $n= 1,...,N$
Evaluate the log-likelihood function using the $T$ observations

$\begin{align} \ln\left(L(\boldsymbol{\theta})\right) =& \ln \left(\prod_{t=1}^T c_{\theta}(u_{1,t}, u_{2,t},...,u_{N,t})\right) \\ =& \sum \limits_{t=1}^T \ln \left(c_{\theta}(u_{1,t}, u_{2,t},...,u_{N,t})\right) \\ \end{align}$
Maximization gets more complex as the number of variables increase

In practice this is done using a suitable computer package and the application of numerical methods

Obtaining the copula density function

Recall: If all distribution functions are continuous then:

$c(u_1,...u_N) = \dfrac{f(x_1,...,x_N)}{f(x_1)f(x_2)...f(x_N)}$

Closed form marginal density function

If marginal density function are available in closed form

$\hookrightarrow$ Probabilities can be expressed in terms of the unknown parameters ( $\boldsymbol{\theta}$ ) and the log likelihood function maximized
Alternatively

The values of $u_{n,t} = F(X_{n,t})$ can be derived empirically from the observations and used to calculate the log likelihood function for the candidate copula, maximizing to determine the optimum parameter values

Gaussian copula

MLE of the sample covariance matrix:

$\hat{\boldsymbol{\Sigma}} = \dfrac{1}{T} \sum \limits_{t=1}^T \boldsymbol{\Phi}_t^{-1} \left(\boldsymbol{\Phi}_t^{-1}\right)'$

Where: $\boldsymbol{\Phi}_t^{-1} = \left[ \Phi^{-1}\left( F \left( x_{1,t} \right) \right), \Phi^{-1}\left( F \left( x_{2,t} \right) \right),..., \Phi^{-1}\left( F \left( x_{N,t} \right) \right) \right]'$
e.g. Derivation of the sample covariance matrix when $t=1$ and $N=2$ if the marginal distributions are $N(0,1)$

$\begin{align} \hat{\boldsymbol{\Sigma}} &= \boldsymbol{\Phi}_t^{-1} \left(\boldsymbol{\Phi}_t^{-1}\right)' \\ &= \begin{bmatrix} \Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right) \\ \Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right) \\ \end{bmatrix} \begin{bmatrix} \Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right) & \Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right) \\ \end{bmatrix} \\ &= \begin{bmatrix} \left(\Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right)\right)^2 & \Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right) \Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right) \\ \Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right) \Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right) & \left(\Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right)\right)^2 \\ \end{bmatrix} \\ & =\begin{bmatrix} x^2_1 & x_1x_2 \\ x_2x_1 & x^2_2 \\ \end{bmatrix} \end{align}$

Choosing between candidate copulas

Determined optimum parameter values for each candidate copula
Compare the values of the ML functions (evaluating using the observations and the relevant optimum parameters) to select the optimal overall model

Fitting a Model to Data

Least Squares Regression

Model the r.v. $Y_t$ (for $t=1,2,...,T$ ) with $N$ independent explanatory variables $X_{t,n}$ ( $n=1,2,...,N$ )

$Y_t = \beta_1 X_{t,1} + \beta_2 X_{t,2} + \dots + \beta_N X_{t,N} + \epsilon_t$

$\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}$
$\epsilon_t$ quantifies the degree to which the variables $X_{t,n}$ fail to fully explain the dependent variable $Y_t$
$\beta_1$ can be a constant if we set $X_{t,1}$ to be a fixed amount for all $t$

Ordinary Least Squares

Fitting OLS

For OLS the $\beta_n$ s are selected to minimize the SSE:

$\boldsymbol{\epsilon}'\boldsymbol{\epsilon} = \epsilon^2_1 + \epsilon^2_2 + \dots + \epsilon^2_T$
Closed form solution for the minimization problem:

$\mathbf{b} = \left(\hat{\beta}_1, \hat{\beta}_2, ... ,\hat{\beta}_N \right)' = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}$

Assumptions

Linear relationship between variables (variables with non-linear relationships need to be first transform before fitting)
Inverse of the data exists

(i.e. has full column rank, no column is a linear transformation or combination of any others)
Error terms properties*
1. Not correlated with each other
  
  (i.e. no serial correlation exists that has not been modeled by the explanatory variables)
2. Constant and finite variance $\sigma^2$
3. Normally distributed
  
  (Only needed for the significance tests)

Generalized Least Squares

For GLS lossens OLS’ assumptions on the error terms

Variance of the error terms is not necessarily assumed to be constant
Error terms don’t have to be uncorrelated with each other

Variances and covariances of the error terms = $\sigma^2 \boldsymbol{\Omega}$

$\sigma^2$ : constant
$\boldsymbol{\Omega}$ : Matrix of weightings

Then:

Uncorrelated error terms with non-constant variance can be modeled by setting:

$\boldsymbol{\Omega} = diag(\Omega_{1,1}, \Omega_{2,2},...,\Omega_{T,T})$
- A diagonal matrix with entries that adjust (scale) $\sigma^2$ appropriately over time
Serial correlated error terms with constant variance can be modeled by:

Including correlations between observations in the off-diagonal entries of $\boldsymbol{\Omega}$ and setting the diagonal entries all to 1
Both heteroskedastic and serial correlation

$\sigma^2 \Omega = \sigma^2 \begin{bmatrix} \Omega_{1,1} & \rho_{1,2} & \dots & \rho_{1,T} \\ \rho_{2,1} & \Omega_{2,2} & \dots & \rho_{2,T} \\ \vdots & \vdots & \ddots & \vdots \\ \rho_{T,1} & \rho_{T,2} & \dots & \Omega_{T,T} \\ \end{bmatrix} = \begin{bmatrix} \sigma^2_1 & \sigma_{1,2} & \dots & \sigma_{1,T} \\ \sigma_{2,1} & \sigma^2_2 & \dots & \sigma_{2,T} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{T,1} & \sigma_{T,2} & \dots & \sigma^2_T \\ \end{bmatrix}$

Closed form solution of minimizing the SSE:

$\mathbf{b} = (b_1,b_2,...,b_N)' = (\mathbf{X}'\boldsymbol{\Omega}^{-1}\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\Omega}^{-1}\mathbf{Y}$

Test of the overall fit

Coefficients of determination¹ ( $R^2$ ) can be used to determine the overall fit
- Higher values = better fit
- Use adjusted coefficient of determination² ( $R_{\alpha}^2$ ), which does not automatically increase just by adding extra parameters
Can use F-test³ to test the overall regression results (if the error terms are normally distributed)

$H_0$ is that all regression coefficients are zero

Test of the individual regression coefficients

Test whether each variable is significant by estimating the variance of the error terms ⁴

If we assume normally distributed errors, we can use the t-test

$H_0$ is the $\beta_n$ is 0
Test is for the level of significant by which the coefficient ( $b_n$ ) differs from zero

Methods based on Likelihood Functions

Likelihood ratio test

Similar when fitting distributions but iterative techniques maybe required

Nested models:

If the 2^nd model contains all the independent variables of the first + some additional variables

Likelihood ratio test:

Test whether the addition of these variables results in significantly improved explanatory power

$H_0$ is that it is not the case
Test statistics:

$LR = -2 \ln(L_1/L_2) \sim \chi^2_{N_2 - N_1}$
- $L_1$ and $L_2$ are the values of the likelihood functions for the 2 models
- $N_1$ and $N_2$ are the # of independent variables in each model incl. the constants

Information Criteria (IC)

Use to compare alternative models

IC are not restricted to the comparison of nested models as is the case with likelihood ratio tests

IC only enable the ranking of alternative models

i.e. Does not quantify the statistical significant of any differences between the alternatives

Akaike Information Criterion (AIC)

$AIC = 2N - 2 \ln(L)$

Bayesian Information Critersion (BIC)

$BIC = N \ln(T) - 2 \ln(L)$

$N$ : # of independent variables in the model
$T$ : # of observations

Lower the value of AIC the better the fit of the model to the data (lower the better)

BIC penalizes the introduction of another independent variable more severely

$\hookrightarrow$ So its used tends to result in less complex models being selected compared to when using AIC

Principal Component Analysis

Fit data to independent parameters weighting their relative importance by the size of their eigenvalues (Module 16)

Singular Value Decomposition

Advantage of PCA

Readily facilitates stochastic projections

Disadvantage of PCA

Model parameters do not necessarily have any intuitive interpretation (limited explanatory power)
Requires identification of the covariance matrix for the chosen $N$ independent explanatory variables

Advantage of SVD over PCA

Based on a least squares optimization

Does not require identification of the covariance matrix
Operates on the original data with no requirement to identify independent variables upon which to base a regression

Assumptions for using SVD

SVD determine the best linear relationship between the values of a set of $N$ variables $X_{1,t},X_{2,t},...,X_{N,t}$ at each time $t$
If we assume the relationship continues (i.e. if we have a new row of data for time $N+1$ ), we can use this relationship to predict the value of one or more “missing” future data values for these variables

Process of applying SVD to a set of $M$ observations of $N$ variables

Let $\mathbf{X}$ ( $M \times N$ matrix) be a set of $M$ observations of $N$ variables
If $\mathbf{X}$ has column rank $R$

$\hookrightarrow$ It can be expressed as the linear combination of $R$ orthogonal matrices
- Where each matrices cannot be expressed as a linear combination of each others (by definition of orthogonal)
Then: each orthogonal matrices an be broken down as product of 2 vectors:

$\begin{bmatrix} X_{1,1} & X_{1,2} & \dots & X_{1,N} \\ X_{2,1} & X_{2,2} & \dots & X_{2,N} \\ \vdots & \vdots & \ddots & \vdots \\ X_{M,1} & X_{M,2} & \dots & X_{M,N} \\ \end{bmatrix} = \sum \limits_{i=1}^R L_i \begin{bmatrix} U_{1,i} \\ U_{2,i} \\ \vdots \\ U_{M,i} \\ \end{bmatrix} \begin{bmatrix} V_{1,i} & V_{2,i} \dots V_{N,i} \\ \end{bmatrix}$
1. Singular values $L_i$ :
  - Square roots of the eigenvectors of $\mathbf{X}\mathbf{X}'$
  - $L_1$ is the largest with ever decreasing subsequent value down to $L_R$
2. Left singular vectors: $\mathbf{U}_i$
3. Right singular vectors: $\mathbf{V}_i$

The 3 components above are found from the original dataset using an iterative method (rather than covariance matrix as in PCA)

Similar to PCA, a large proportion of the variation in a series of data might be explained by a small number of factors

$\hookrightarrow$ So the iterative process might be terminated before all the singular vectors have been determined
- e.g. when $L_r$ is sufficiently small
$\hookrightarrow$ We can make good predictions of future data values based on a small subset of the other variables

(See more details in appendix of CMP)

SVD Application Example

Lee-Carter model for mortality rates

Assumes that mortality rates at all ages ( $x$ ) are determined by: $\ln(m_{x,t}) = \alpha_x+ \beta_x \kappa_t + \epsilon_{x,t}$

$\alpha_x$ :

Age specific parameter that indicates the average level of $\ln(m_{x,t})$ over time $t$
$\beta_x$ :

Age specific parameter that characterises the sensitivity of $\ln(m_{x,t})$ to changes in a mortality index $\kappa_t$
$\epsilon_{x,t}$ :

Error term that captures all remaining variations

Parameters on the r.h.s are unobservable (can’t use OLS or PCA)

Applying SVD

Set the $\hat{\alpha}_x$ to the average of $\ln m_{x,t}$ over the sample period
Apply SVD to the matrix of $\left\{ \ln m_{x,t} - \hat{\alpha}_x \right\}$
The resulting first left and right singular vectors enable estimation of $\beta_x$ and $\kappa_t$ respectively

(CMP has more details on the matrices…)

Model Selection

For a complex model to be worth using we need to make sure the additional complexity is justified by having significant improvement in the maximum log likelihood

If choosing simply by highest log likelihood then model with largest # of parameters will prevail (esp. when simpler models are nested within the largest model)
AIC and BIC penalize for extra parameters (while BIC is more punitive $\Rightarrow$ less complex models)
Other quantitative tests such as $chi^2$ test can be used

Graphical diagnostic tests

QQ plots
Histograms with superimposed fitted density functions
Empirical CDFs with superimposed fitted CDFs
Autocorrelation functions of time series data (ACFs)

Use the above plot and states whether or not a given plot is consistent with a stated hypothesis or model

If consistent, should propose additional quantitative tests
If not, should be able to interpret the plot and suggest alternative models

Need to be able to recommend a specific choice of copula by applying both quantitative and qualitative analysis using different models

Quantitative: test of goodness of fit, AIC, BIC
Qualitative: graphical comparisons of data with candidate copulas

Model Validation

Model should be fitted using training set and then fit with the testing set of comparable size

Type of test depends on the form of the model

Backtesting:

Fitting a time series model to data from one time period and then testing how well it predicts observed values from a subsequent period
Cross-sectional models:

Based on dependent variables, can be fitted using one set of the data (training set) and tested using an independent data set (not necessarily from a different time period)

$R^2 = \dfrac{SSR}{SST} = 1 - \dfrac{SSE}{SST} \in [0,1]$

Where:

$\begin{array}{c,c,c,c,c,c} &SST &= &SSR &+ &SSE \\ & \sum \limits_{t=1}^T(Y_t - \bar{Y})^2 &= &\sum \limits_{t=1}^T(\hat{Y}_t - \bar{Y})^2 &+ &\sum \limits_{t=1}^T(\underbrace{Y_t - \hat{Y}_t}_{\epsilon_t^2})^2 \\ \end{array}$
- $SST$ : = total sum of squares
- $SSR$ = sum of squares explained by regression
- $SSE$ = sum of squares error (unexplained deviations)
- Average of the observation: $\bar{Y} = \dfrac{1}{T}\sum \limits_{t=1}^T Y_t$
- Predicted value: $\hat{Y}_t = \sum \limits_{n=1}^N b_n x_{t,n}$
↩
$R_{\alpha}^2 = 1 - \dfrac{SSE/(T-N)}{SST/(T-1)} = 1 - \dfrac{T-1}{T-N}(1-R^2)$ ↩
$\dfrac{SSR/(N-1)}{SSE/(T-N)} = dfrac{R^2/(N-1)}{(1-R^2)/(T-N)} \sim F^{N-1}_{T-N}$ ↩
$s^2 = \dfrac{SSE}{T-N}$

Where:
- Deduction of the number of explanatory variables ( $N$ ) in the denominator ensures that the estimate is unbiased
- $s$ is called the standard error of the regression
Sample covariance matrix for the vector of estimates $\mathbf{b}$ is $\mathbf{S_b}= s^2(\mathbf{X}'\mathbf{X})^{-1}$

Square root of the $n$ th element of the diagonal in $_b is $s_{b_n}$ , the standard error of the estimate $b_n$

Assuming normally distributed error term, we can use the t-test with the following statistics:
$\dfrac{b_n - \beta_n}{s_{b_n}} \sim t_{T-N}$
Typical confidence level is 90%↩

Module 19: Fitting Models

C. Lau

September 28, 2016

Module 19 Objective

Fitting a Distribution to Data

Method of Moments

Univariate Distribution

Copulas

Features of the method of moments

Maximum Likelihood

Features of MLE

Univariate Distribution

N-dimensional copulas

Choosing between candidate copulas

Fitting a Model to Data

Least Squares Regression

Ordinary Least Squares

Generalized Least Squares

Test of the overall fit

Test of the individual regression coefficients

Methods based on Likelihood Functions

Likelihood ratio test

Information Criteria (IC)

Principal Component Analysis

Singular Value Decomposition

SVD Application Example

Model Selection

Model Validation