Processing math: 5%
Fork me on GitHub
Module 19: Fitting Models
  • Module 19 Objective
  • Fitting a Distribution to Data
    • Method of Moments
      • Univariate Distribution
      • Copulas
      • Features of the method of moments
    • Maximum Likelihood
      • Features of MLE
      • Univariate Distribution
      • N-dimensional copulas
      • Choosing between candidate copulas
  • Fitting a Model to Data
    • Least Squares Regression
      • Ordinary Least Squares
      • Generalized Least Squares
      • Test of the overall fit
      • Test of the individual regression coefficients
    • Methods based on Likelihood Functions
      • Likelihood ratio test
      • Information Criteria (IC)
    • Principal Component Analysis
    • Singular Value Decomposition
      • SVD Application Example
  • Model Selection
  • Model Validation

Module 19 Objective

Recommend a specific choice of model based on the results of both quantitative and qualitative analysis of financial or insurance data


Short list a number of possible (candidate) models (based on certain features of the data)

  • E.g. tail dependence shown in the data will indicate which choices of copula

This module focus on how each candidate model might be fitted to the data set

  1. Select the method of fitting

  2. Evaluate the goodness of fit

    1. For the model as a whole

    2. For each of the explanatory variables

    3. Use both quantitative and qualitative methods

  3. Final model selection

First consider the fitting of a distribution to a given data set, then go on to look at the fitting of more complex models to a data set

Exam note:

  • Key objective is to be able to recommend a specific choice of model or models for further analysis of specific problems

  • Recommendation needs to be made on the basis of a variety of quantitative measures as well as qualitative analysis

Fitting a Distribution to Data

Method of Moments

Univariate Distribution

Parameterization of univariate distribution

  • Establish the parameters of a distribution empirically by looking at the sample moments and equating these to the population (or true) moments

    e.g. for a distribution with 3 parameters and n data points x1,...,xn

    • Set E(X)=1nxi; E(X2)=1nx2i; E(X3)=1nx3i

    • Solve the equations simultaneously to obtain the parameter values needed to specify the distribution

Copulas

Parameterization of copulas

Generally this is done by using estimates of the rank correlation of the data

  • Set the true, underlying correlation of the copula to = these estimated correlations

  • Specific circumstances will influence the choice of whether to use Kendall’s τ or Spearman’s ρ

For Archimedean copulas

  • Gumbel, Clayton, and Frank are all characterized by α via their generators

    estimate the value of \alpha based on the observed data

  • See Module 18 for relationships between \alpha and \tau

  • Estimate the value of \tau with the data and use the formulas to set it equal to the expression of \alpha and then solve

    e.g. Gumbel: \tau = 1 - \dfrac{1}{\alpha} \Rightarrow \alpha = \dfrac{1}{1 - \tau}

Features of the method of moments

Advantages

  • Generally more straightforward to use than other alternatives

Disadvantages

  • Parameters not necessarily the most likely ones

  • Parameter values may be outside their acceptable ranges

Maximum Likelihood

Features of MLE

Advantages

  • Only generates parameter values that are within the acceptable ranges

  • Any bias in the parameter estimates reduces as the number of observations increases

  • Distribution of each parameter estimate tends toward the normal distribution

Univariate Distribution

Parameterization of Univariate Distribution

  • Likelihood function:

    Expresses the joint probability of the actual observations (x_1,...,x_T) occurring, given the choice of candidate distribution

  • Maximize (w.r.t. each parameter) the log likelihood function

    \ln L = \sum \limits_{t=1}^T \ln f(x_t)

    • f(x) is the marginal PDF of X

    • Effectively maximizing the joint probability L=\prod_{t=1}^T f(x_t)

    • Involve solving several simultaneous equations of the form \partial (\ln (L)) / \partial p_i = 0 where p_i (i=1,...,N) is one of the N parameters of the candidate distribution

N-dimensional copulas

Parameterization of N-dimensional copulas

  • For MLE we require the density function of the copula c(u_1,...,u_N) = \dfrac{\partial^N C(u_1,...,u_n)}{\partial u_1 ... \partial u_N}

    • N is the number of variables (= dimensions of the copula)

    • u_n = F(x_n) for n= 1,...,N

  • Evaluate the log-likelihood function using the T observations

    \begin{align} \ln\left(L(\boldsymbol{\theta})\right) =& \ln \left(\prod_{t=1}^T c_{\theta}(u_{1,t}, u_{2,t},...,u_{N,t})\right) \\ =& \sum \limits_{t=1}^T \ln \left(c_{\theta}(u_{1,t}, u_{2,t},...,u_{N,t})\right) \\ \end{align}

  • Maximization gets more complex as the number of variables increase

    In practice this is done using a suitable computer package and the application of numerical methods

Obtaining the copula density function

Recall: If all distribution functions are continuous then:

c(u_1,...u_N) = \dfrac{f(x_1,...,x_N)}{f(x_1)f(x_2)...f(x_N)}

  • Closed form marginal density function

    If marginal density function are available in closed form

    \hookrightarrow Probabilities can be expressed in terms of the unknown parameters (\boldsymbol{\theta}) and the log likelihood function maximized

  • Alternatively

    The values of u_{n,t} = F(X_{n,t}) can be derived empirically from the observations and used to calculate the log likelihood function for the candidate copula, maximizing to determine the optimum parameter values

Gaussian copula

  • MLE of the sample covariance matrix:

    \hat{\boldsymbol{\Sigma}} = \dfrac{1}{T} \sum \limits_{t=1}^T \boldsymbol{\Phi}_t^{-1} \left(\boldsymbol{\Phi}_t^{-1}\right)'

    Where: \boldsymbol{\Phi}_t^{-1} = \left[ \Phi^{-1}\left( F \left( x_{1,t} \right) \right), \Phi^{-1}\left( F \left( x_{2,t} \right) \right),..., \Phi^{-1}\left( F \left( x_{N,t} \right) \right) \right]'

  • e.g. Derivation of the sample covariance matrix when t=1 and N=2 if the marginal distributions are N(0,1)

    \begin{align} \hat{\boldsymbol{\Sigma}} &= \boldsymbol{\Phi}_t^{-1} \left(\boldsymbol{\Phi}_t^{-1}\right)' \\ &= \begin{bmatrix} \Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right) \\ \Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right) \\ \end{bmatrix} \begin{bmatrix} \Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right) & \Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right) \\ \end{bmatrix} \\ &= \begin{bmatrix} \left(\Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right)\right)^2 & \Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right) \Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right) \\ \Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right) \Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right) & \left(\Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right)\right)^2 \\ \end{bmatrix} \\ & =\begin{bmatrix} x^2_1 & x_1x_2 \\ x_2x_1 & x^2_2 \\ \end{bmatrix} \end{align}

Choosing between candidate copulas

  1. Determined optimum parameter values for each candidate copula

  2. Compare the values of the ML functions (evaluating using the observations and the relevant optimum parameters) to select the optimal overall model

Fitting a Model to Data

Least Squares Regression

Model the r.v. Y_t (for t=1,2,...,T) with N independent explanatory variables X_{t,n} (n=1,2,...,N)

Y_t = \beta_1 X_{t,1} + \beta_2 X_{t,2} + \dots + \beta_N X_{t,N} + \epsilon_t

  • \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}

  • \epsilon_t quantifies the degree to which the variables X_{t,n} fail to fully explain the dependent variable Y_t

  • \beta_1 can be a constant if we set X_{t,1} to be a fixed amount for all t

Ordinary Least Squares

Fitting OLS

  • For OLS the \beta_ns are selected to minimize the SSE:

    \boldsymbol{\epsilon}'\boldsymbol{\epsilon} = \epsilon^2_1 + \epsilon^2_2 + \dots + \epsilon^2_T

  • Closed form solution for the minimization problem:

    \mathbf{b} = \left(\hat{\beta}_1, \hat{\beta}_2, ... ,\hat{\beta}_N \right)' = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}

Assumptions

  1. Linear relationship between variables (variables with non-linear relationships need to be first transform before fitting)

  2. Inverse of the data exists

    (i.e. has full column rank, no column is a linear transformation or combination of any others)

  3. Error terms properties*

    1. Not correlated with each other

      (i.e. no serial correlation exists that has not been modeled by the explanatory variables)

    2. Constant and finite variance \sigma^2

    3. Normally distributed

      (Only needed for the significance tests)

Generalized Least Squares

For GLS lossens OLS’ assumptions on the error terms

  1. Variance of the error terms is not necessarily assumed to be constant

  2. Error terms don’t have to be uncorrelated with each other

Variances and covariances of the error terms = \sigma^2 \boldsymbol{\Omega}

  • \sigma^2: constant

  • \boldsymbol{\Omega}: Matrix of weightings

Then:

  • Uncorrelated error terms with non-constant variance can be modeled by setting:

    \boldsymbol{\Omega} = diag(\Omega_{1,1}, \Omega_{2,2},...,\Omega_{T,T})

    • A diagonal matrix with entries that adjust (scale) \sigma^2 appropriately over time
  • Serial correlated error terms with constant variance can be modeled by:

    Including correlations between observations in the off-diagonal entries of \boldsymbol{\Omega} and setting the diagonal entries all to 1

  • Both heteroskedastic and serial correlation

    \sigma^2 \Omega = \sigma^2 \begin{bmatrix} \Omega_{1,1} & \rho_{1,2} & \dots & \rho_{1,T} \\ \rho_{2,1} & \Omega_{2,2} & \dots & \rho_{2,T} \\ \vdots & \vdots & \ddots & \vdots \\ \rho_{T,1} & \rho_{T,2} & \dots & \Omega_{T,T} \\ \end{bmatrix} = \begin{bmatrix} \sigma^2_1 & \sigma_{1,2} & \dots & \sigma_{1,T} \\ \sigma_{2,1} & \sigma^2_2 & \dots & \sigma_{2,T} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{T,1} & \sigma_{T,2} & \dots & \sigma^2_T \\ \end{bmatrix}

Closed form solution of minimizing the SSE:

\mathbf{b} = (b_1,b_2,...,b_N)' = (\mathbf{X}'\boldsymbol{\Omega}^{-1}\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\Omega}^{-1}\mathbf{Y}

Test of the overall fit

  1. Coefficients of determination1 (R^2) can be used to determine the overall fit

    • Higher values = better fit

    • Use adjusted coefficient of determination2 (R_{\alpha}^2), which does not automatically increase just by adding extra parameters

  2. Can use F-test3 to test the overall regression results (if the error terms are normally distributed)

  • H_0 is that all regression coefficients are zero

Test of the individual regression coefficients

Test whether each variable is significant by estimating the variance of the error terms 4

If we assume normally distributed errors, we can use the t-test

  • H_0 is the \beta_n is 0

  • Test is for the level of significant by which the coefficient (b_n) differs from zero

Methods based on Likelihood Functions

Likelihood ratio test

Similar when fitting distributions but iterative techniques maybe required

Nested models:

If the 2nd model contains all the independent variables of the first + some additional variables

Likelihood ratio test:

Test whether the addition of these variables results in significantly improved explanatory power

  • H_0 is that it is not the case

  • Test statistics:

    LR = -2 \ln(L_1/L_2) \sim \chi^2_{N_2 - N_1}

    • L_1 and L_2 are the values of the likelihood functions for the 2 models

    • N_1 and N_2 are the # of independent variables in each model incl. the constants

Information Criteria (IC)

Use to compare alternative models

  • IC are not restricted to the comparison of nested models as is the case with likelihood ratio tests

IC only enable the ranking of alternative models

  • i.e. Does not quantify the statistical significant of any differences between the alternatives

Akaike Information Criterion (AIC)

AIC = 2N - 2 \ln(L)

Bayesian Information Critersion (BIC)

BIC = N \ln(T) - 2 \ln(L)

  • N: # of independent variables in the model

  • T: # of observations

Lower the value of AIC the better the fit of the model to the data (lower the better)

BIC penalizes the introduction of another independent variable more severely

\hookrightarrow So its used tends to result in less complex models being selected compared to when using AIC

Principal Component Analysis

Fit data to independent parameters weighting their relative importance by the size of their eigenvalues (Module 16)

Singular Value Decomposition

Advantage of PCA

  • Readily facilitates stochastic projections

Disadvantage of PCA

  • Model parameters do not necessarily have any intuitive interpretation (limited explanatory power)

  • Requires identification of the covariance matrix for the chosen N independent explanatory variables

Advantage of SVD over PCA

  • Based on a least squares optimization

    Does not require identification of the covariance matrix

  • Operates on the original data with no requirement to identify independent variables upon which to base a regression

Assumptions for using SVD

  • SVD determine the best linear relationship between the values of a set of N variables X_{1,t},X_{2,t},...,X_{N,t} at each time t

  • If we assume the relationship continues (i.e. if we have a new row of data for time N+1), we can use this relationship to predict the value of one or more “missing” future data values for these variables

Process of applying SVD to a set of M observations of N variables

  1. Let \mathbf{X} (M \times N matrix) be a set of M observations of N variables

  2. If \mathbf{X} has column rank R

    \hookrightarrow It can be expressed as the linear combination of R orthogonal matrices

    • Where each matrices cannot be expressed as a linear combination of each others (by definition of orthogonal)
  3. Then: each orthogonal matrices an be broken down as product of 2 vectors:

    \begin{bmatrix} X_{1,1} & X_{1,2} & \dots & X_{1,N} \\ X_{2,1} & X_{2,2} & \dots & X_{2,N} \\ \vdots & \vdots & \ddots & \vdots \\ X_{M,1} & X_{M,2} & \dots & X_{M,N} \\ \end{bmatrix} = \sum \limits_{i=1}^R L_i \begin{bmatrix} U_{1,i} \\ U_{2,i} \\ \vdots \\ U_{M,i} \\ \end{bmatrix} \begin{bmatrix} V_{1,i} & V_{2,i} \dots V_{N,i} \\ \end{bmatrix}

    1. Singular values L_i:

      • Square roots of the eigenvectors of \mathbf{X}\mathbf{X}'

      • L_1 is the largest with ever decreasing subsequent value down to L_R

    2. Left singular vectors: \mathbf{U}_i

    3. Right singular vectors: \mathbf{V}_i

  • The 3 components above are found from the original dataset using an iterative method (rather than covariance matrix as in PCA)

    Similar to PCA, a large proportion of the variation in a series of data might be explained by a small number of factors

    \hookrightarrow So the iterative process might be terminated before all the singular vectors have been determined

    • e.g. when L_r is sufficiently small

    \hookrightarrow We can make good predictions of future data values based on a small subset of the other variables

(See more details in appendix of CMP)

SVD Application Example

Lee-Carter model for mortality rates

Assumes that mortality rates at all ages (x) are determined by: \ln(m_{x,t}) = \alpha_x+ \beta_x \kappa_t + \epsilon_{x,t}

  • \alpha_x:

    Age specific parameter that indicates the average level of \ln(m_{x,t}) over time t

  • \beta_x:

    Age specific parameter that characterises the sensitivity of \ln(m_{x,t}) to changes in a mortality index \kappa_t

  • \epsilon_{x,t}:

    Error term that captures all remaining variations

Parameters on the r.h.s are unobservable (can’t use OLS or PCA)

Applying SVD

  1. Set the \hat{\alpha}_x to the average of \ln m_{x,t} over the sample period

  2. Apply SVD to the matrix of \left\{ \ln m_{x,t} - \hat{\alpha}_x \right\}

  3. The resulting first left and right singular vectors enable estimation of \beta_x and \kappa_t respectively

(CMP has more details on the matrices…)

Model Selection

For a complex model to be worth using we need to make sure the additional complexity is justified by having significant improvement in the maximum log likelihood

  • If choosing simply by highest log likelihood then model with largest # of parameters will prevail (esp. when simpler models are nested within the largest model)

  • AIC and BIC penalize for extra parameters (while BIC is more punitive \Rightarrow less complex models)

  • Other quantitative tests such as chi^2 test can be used

Graphical diagnostic tests

  1. QQ plots

  2. Histograms with superimposed fitted density functions

  3. Empirical CDFs with superimposed fitted CDFs

  4. Autocorrelation functions of time series data (ACFs)

Use the above plot and states whether or not a given plot is consistent with a stated hypothesis or model

  • If consistent, should propose additional quantitative tests

  • If not, should be able to interpret the plot and suggest alternative models

Need to be able to recommend a specific choice of copula by applying both quantitative and qualitative analysis using different models

  • Quantitative: test of goodness of fit, AIC, BIC

  • Qualitative: graphical comparisons of data with candidate copulas

Model Validation

Model should be fitted using training set and then fit with the testing set of comparable size

Type of test depends on the form of the model

  1. Backtesting:

    Fitting a time series model to data from one time period and then testing how well it predicts observed values from a subsequent period

  2. Cross-sectional models:

    Based on dependent variables, can be fitted using one set of the data (training set) and tested using an independent data set (not necessarily from a different time period)


  1. R^2 = \dfrac{SSR}{SST} = 1 - \dfrac{SSE}{SST} \in [0,1]

    Where:

    \begin{array}{c,c,c,c,c,c} &SST &= &SSR &+ &SSE \\ & \sum \limits_{t=1}^T(Y_t - \bar{Y})^2 &= &\sum \limits_{t=1}^T(\hat{Y}_t - \bar{Y})^2 &+ &\sum \limits_{t=1}^T(\underbrace{Y_t - \hat{Y}_t}_{\epsilon_t^2})^2 \\ \end{array}

    • SST: = total sum of squares

    • SSR = sum of squares explained by regression

    • SSE = sum of squares error (unexplained deviations)

    • Average of the observation: \bar{Y} = \dfrac{1}{T}\sum \limits_{t=1}^T Y_t

    • Predicted value: \hat{Y}_t = \sum \limits_{n=1}^N b_n x_{t,n}

  2. R_{\alpha}^2 = 1 - \dfrac{SSE/(T-N)}{SST/(T-1)} = 1 - \dfrac{T-1}{T-N}(1-R^2)

  3. \dfrac{SSR/(N-1)}{SSE/(T-N)} = dfrac{R^2/(N-1)}{(1-R^2)/(T-N)} \sim F^{N-1}_{T-N}

  4. s^2 = \dfrac{SSE}{T-N}

    Where:

    • Deduction of the number of explanatory variables (N) in the denominator ensures that the estimate is unbiased
    • s is called the standard error of the regression

    Sample covariance matrix for the vector of estimates \mathbf{b} is \mathbf{S_b}= s^2(\mathbf{X}'\mathbf{X})^{-1}

    Square root of the nth element of the diagonal in $_b is s_{b_n}, the standard error of the estimate b_n

    Assuming normally distributed error term, we can use the t-test with the following statistics:
    \dfrac{b_n - \beta_n}{s_{b_n}} \sim t_{T-N}
    Typical confidence level is 90%