Recommend a specific choice of model based on the results of both quantitative and qualitative analysis of financial or insurance data
Short list a number of possible (candidate) models (based on certain features of the data)
This module focus on how each candidate model might be fitted to the data set
Select the method of fitting
Evaluate the goodness of fit
For the model as a whole
For each of the explanatory variables
Use both quantitative and qualitative methods
Final model selection
First consider the fitting of a distribution to a given data set, then go on to look at the fitting of more complex models to a data set
Exam note:
Key objective is to be able to recommend a specific choice of model or models for further analysis of specific problems
Recommendation needs to be made on the basis of a variety of quantitative measures as well as qualitative analysis
Parameterization of univariate distribution
Establish the parameters of a distribution empirically by looking at the sample moments
and equating these to the population (or true) moments
e.g. for a distribution with 3 parameters and n data points x1,...,xn
Set E(X)=1n∑xi; E(X2)=1n∑x2i; E(X3)=1n∑x3i
Solve the equations simultaneously to obtain the parameter values needed to specify the distribution
Parameterization of copulas
Generally this is done by using estimates of the rank correlation of the data
Set the true, underlying correlation
of the copula to = these estimated correlations
Specific circumstances will influence the choice of whether to use Kendall’s τ or Spearman’s ρ
For Archimedean copulas
Gumbel
, Clayton
, and Frank
are all characterized by α via their generators
∴ estimate the value of \alpha based on the observed data
See Module 18 for relationships between \alpha and \tau
Estimate the value of \tau with the data and use the formulas to set it equal to the expression of \alpha and then solve
e.g. Gumbel: \tau = 1 - \dfrac{1}{\alpha} \Rightarrow \alpha = \dfrac{1}{1 - \tau}
Advantages
Disadvantages
Parameters not necessarily the most likely ones
Parameter values may be outside their acceptable ranges
Advantages
Only generates parameter values that are within the acceptable ranges
Any bias in the parameter estimates reduces as the number of observations increases
Distribution of each parameter estimate tends toward the normal distribution
Parameterization of Univariate Distribution
Likelihood function:
Expresses the joint probability of the actual observations (x_1,...,x_T) occurring, given the choice of candidate distribution
Maximize (w.r.t. each parameter) the log likelihood function
\ln L = \sum \limits_{t=1}^T \ln f(x_t)
f(x) is the marginal PDF of X
Effectively maximizing the joint probability L=\prod_{t=1}^T f(x_t)
Involve solving several simultaneous equations of the form \partial (\ln (L)) / \partial p_i = 0 where p_i (i=1,...,N) is one of the N parameters of the candidate distribution
Parameterization of N-dimensional copulas
For MLE we require the density function of the copula c(u_1,...,u_N) = \dfrac{\partial^N C(u_1,...,u_n)}{\partial u_1 ... \partial u_N}
N is the number of variables (= dimensions of the copula)
u_n = F(x_n) for n= 1,...,N
Evaluate the log-likelihood function using the T observations
\begin{align} \ln\left(L(\boldsymbol{\theta})\right) =& \ln \left(\prod_{t=1}^T c_{\theta}(u_{1,t}, u_{2,t},...,u_{N,t})\right) \\ =& \sum \limits_{t=1}^T \ln \left(c_{\theta}(u_{1,t}, u_{2,t},...,u_{N,t})\right) \\ \end{align}
Maximization gets more complex as the number of variables increase
In practice this is done using a suitable computer package and the application of numerical methods
Obtaining the copula density function
Recall: If all distribution functions are continuous then:
c(u_1,...u_N) = \dfrac{f(x_1,...,x_N)}{f(x_1)f(x_2)...f(x_N)}
Closed form marginal density function
If marginal density function are available in closed form
\hookrightarrow Probabilities can be expressed in terms of the unknown parameters (\boldsymbol{\theta}) and the log likelihood function maximized
Alternatively
The values of u_{n,t} = F(X_{n,t}) can be derived empirically from the observations and used to calculate the log likelihood function for the candidate copula, maximizing to determine the optimum parameter values
Gaussian copula
MLE of the sample covariance matrix:
\hat{\boldsymbol{\Sigma}} = \dfrac{1}{T} \sum \limits_{t=1}^T \boldsymbol{\Phi}_t^{-1} \left(\boldsymbol{\Phi}_t^{-1}\right)'
Where: \boldsymbol{\Phi}_t^{-1} = \left[ \Phi^{-1}\left( F \left( x_{1,t} \right) \right), \Phi^{-1}\left( F \left( x_{2,t} \right) \right),..., \Phi^{-1}\left( F \left( x_{N,t} \right) \right) \right]'
e.g. Derivation of the sample covariance matrix when t=1 and N=2 if the marginal distributions are N(0,1)
\begin{align} \hat{\boldsymbol{\Sigma}} &= \boldsymbol{\Phi}_t^{-1} \left(\boldsymbol{\Phi}_t^{-1}\right)' \\ &= \begin{bmatrix} \Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right) \\ \Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right) \\ \end{bmatrix} \begin{bmatrix} \Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right) & \Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right) \\ \end{bmatrix} \\ &= \begin{bmatrix} \left(\Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right)\right)^2 & \Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right) \Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right) \\ \Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right) \Phi^{-1}_{X_1} \left( F_{X_1} \left( x_{1} \right) \right) & \left(\Phi^{-1}_{X_2} \left( F_{X_2} \left( x_{2} \right) \right)\right)^2 \\ \end{bmatrix} \\ & =\begin{bmatrix} x^2_1 & x_1x_2 \\ x_2x_1 & x^2_2 \\ \end{bmatrix} \end{align}
Determined optimum parameter values for each candidate copula
Compare the values of the ML functions (evaluating using the observations and the relevant optimum parameters) to select the optimal overall model
Model the r.v. Y_t (for t=1,2,...,T) with N independent explanatory variables X_{t,n} (n=1,2,...,N)
Y_t = \beta_1 X_{t,1} + \beta_2 X_{t,2} + \dots + \beta_N X_{t,N} + \epsilon_t
\mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}
\epsilon_t quantifies the degree to which the variables X_{t,n} fail to fully explain the dependent variable Y_t
\beta_1 can be a constant if we set X_{t,1} to be a fixed amount for all t
Fitting OLS
For OLS the \beta_ns are selected to minimize the SSE:
\boldsymbol{\epsilon}'\boldsymbol{\epsilon} = \epsilon^2_1 + \epsilon^2_2 + \dots + \epsilon^2_T
Closed form solution for the minimization problem:
\mathbf{b} = \left(\hat{\beta}_1, \hat{\beta}_2, ... ,\hat{\beta}_N \right)' = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y}
Assumptions
Linear relationship between variables (variables with non-linear relationships need to be first transform before fitting)
Inverse of the data exists
(i.e. has full column rank, no column is a linear transformation or combination of any others)
Error terms properties*
Not correlated with each other
(i.e. no serial correlation exists that has not been modeled by the explanatory variables)
Constant and finite variance \sigma^2
Normally distributed
(Only needed for the significance tests)
For GLS lossens OLS’ assumptions on the error terms
Variance of the error terms is not necessarily assumed to be constant
Error terms don’t have to be uncorrelated with each other
Variances and covariances of the error terms = \sigma^2 \boldsymbol{\Omega}
\sigma^2: constant
\boldsymbol{\Omega}: Matrix of weightings
Then:
Uncorrelated error terms with non-constant variance can be modeled by setting:
\boldsymbol{\Omega} = diag(\Omega_{1,1}, \Omega_{2,2},...,\Omega_{T,T})
Serial correlated error terms with constant variance can be modeled by:
Including correlations between observations in the off-diagonal entries of \boldsymbol{\Omega} and setting the diagonal entries all to 1
Both heteroskedastic and serial correlation
\sigma^2 \Omega = \sigma^2 \begin{bmatrix} \Omega_{1,1} & \rho_{1,2} & \dots & \rho_{1,T} \\ \rho_{2,1} & \Omega_{2,2} & \dots & \rho_{2,T} \\ \vdots & \vdots & \ddots & \vdots \\ \rho_{T,1} & \rho_{T,2} & \dots & \Omega_{T,T} \\ \end{bmatrix} = \begin{bmatrix} \sigma^2_1 & \sigma_{1,2} & \dots & \sigma_{1,T} \\ \sigma_{2,1} & \sigma^2_2 & \dots & \sigma_{2,T} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{T,1} & \sigma_{T,2} & \dots & \sigma^2_T \\ \end{bmatrix}
Closed form solution of minimizing the SSE:
\mathbf{b} = (b_1,b_2,...,b_N)' = (\mathbf{X}'\boldsymbol{\Omega}^{-1}\mathbf{X})^{-1}\mathbf{X}'\boldsymbol{\Omega}^{-1}\mathbf{Y}
Coefficients of determination1 (R^2) can be used to determine the overall fit
Higher values = better fit
Use adjusted coefficient of determination2 (R_{\alpha}^2), which does not automatically increase just by adding extra parameters
Can use F-test3 to test the overall regression results (if the error terms are normally distributed)
Test whether each variable is significant by estimating the variance of the error terms 4
If we assume normally distributed errors, we can use the t-test
H_0 is the \beta_n is 0
Test is for the level of significant by which the coefficient (b_n) differs from zero
Similar when fitting distributions but iterative techniques maybe required
Nested models:
If the 2nd model contains all the independent variables of the first + some additional variables
Likelihood ratio test:
Test whether the addition of these variables results in significantly improved explanatory power
H_0 is that it is not the case
Test statistics:
LR = -2 \ln(L_1/L_2) \sim \chi^2_{N_2 - N_1}
L_1 and L_2 are the values of the likelihood functions for the 2 models
N_1 and N_2 are the # of independent variables in each model incl. the constants
Use to compare alternative models
IC only enable the ranking of alternative models
Akaike Information Criterion (AIC)
AIC = 2N - 2 \ln(L)
Bayesian Information Critersion (BIC)
BIC = N \ln(T) - 2 \ln(L)
N: # of independent variables in the model
T: # of observations
Lower the value of AIC the better the fit of the model to the data (lower the better)
BIC penalizes the introduction of another independent variable more severely
\hookrightarrow So its used tends to result in less complex models being selected compared to when using AIC
Fit data to independent parameters weighting their relative importance by the size of their eigenvalues (Module 16)
Advantage of PCA
Disadvantage of PCA
Model parameters do not necessarily have any intuitive interpretation (limited explanatory power)
Requires identification of the covariance matrix for the chosen N independent explanatory variables
Advantage of SVD over PCA
Based on a least squares optimization
Does not require identification of the covariance matrix
Operates on the original data with no requirement to identify independent variables upon which to base a regression
Assumptions for using SVD
SVD determine the best linear relationship between the values of a set of N variables X_{1,t},X_{2,t},...,X_{N,t} at each time t
If we assume the relationship continues (i.e. if we have a new row of data for time N+1), we can use this relationship to predict the value of one or more “missing” future data values for these variables
Process of applying SVD to a set of M observations of N variables
Let \mathbf{X} (M \times N matrix) be a set of M observations of N variables
If \mathbf{X} has column rank R
\hookrightarrow It can be expressed as the linear combination of R orthogonal matrices
Then: each orthogonal matrices an be broken down as product of 2 vectors:
\begin{bmatrix} X_{1,1} & X_{1,2} & \dots & X_{1,N} \\ X_{2,1} & X_{2,2} & \dots & X_{2,N} \\ \vdots & \vdots & \ddots & \vdots \\ X_{M,1} & X_{M,2} & \dots & X_{M,N} \\ \end{bmatrix} = \sum \limits_{i=1}^R L_i \begin{bmatrix} U_{1,i} \\ U_{2,i} \\ \vdots \\ U_{M,i} \\ \end{bmatrix} \begin{bmatrix} V_{1,i} & V_{2,i} \dots V_{N,i} \\ \end{bmatrix}
Singular values L_i:
Square roots of the eigenvectors of \mathbf{X}\mathbf{X}'
L_1 is the largest with ever decreasing subsequent value down to L_R
Left singular vectors: \mathbf{U}_i
Right singular vectors: \mathbf{V}_i
The 3 components above are found from the original dataset using an iterative method (rather than covariance matrix as in PCA)
Similar to PCA, a large proportion of the variation in a series of data might be explained by a small number of factors
\hookrightarrow So the iterative process might be terminated before all the singular vectors have been determined
\hookrightarrow We can make good predictions of future data values based on a small subset of the other variables
(See more details in appendix of CMP)
Lee-Carter model for mortality rates
Assumes that mortality rates at all ages (x) are determined by: \ln(m_{x,t}) = \alpha_x+ \beta_x \kappa_t + \epsilon_{x,t}
\alpha_x:
Age specific parameter that indicates the average level of \ln(m_{x,t}) over time t
\beta_x:
Age specific parameter that characterises the sensitivity of \ln(m_{x,t}) to changes in a mortality index \kappa_t
\epsilon_{x,t}:
Error term that captures all remaining variations
Parameters on the r.h.s are unobservable (can’t use OLS or PCA)
Applying SVD
Set the \hat{\alpha}_x to the average of \ln m_{x,t} over the sample period
Apply SVD to the matrix of \left\{ \ln m_{x,t} - \hat{\alpha}_x \right\}
The resulting first left and right singular vectors enable estimation of \beta_x and \kappa_t respectively
(CMP has more details on the matrices…)
For a complex model to be worth using we need to make sure the additional complexity is justified by having significant improvement in the maximum log likelihood
If choosing simply by highest log likelihood then model with largest # of parameters will prevail (esp. when simpler models are nested within the largest model)
AIC and BIC penalize for extra parameters (while BIC is more punitive \Rightarrow less complex models)
Other quantitative tests such as chi^2 test can be used
Graphical diagnostic tests
QQ plots
Histograms with superimposed fitted density functions
Empirical CDFs with superimposed fitted CDFs
Autocorrelation functions of time series data (ACFs)
Use the above plot and states whether or not a given plot is consistent with a stated hypothesis or model
If consistent, should propose additional quantitative tests
If not, should be able to interpret the plot and suggest alternative models
Need to be able to recommend a specific choice of copula by applying both quantitative and qualitative analysis using different models
Quantitative: test of goodness of fit, AIC, BIC
Qualitative: graphical comparisons of data with candidate copulas
Model should be fitted using training set and then fit with the testing set of comparable size
Type of test depends on the form of the model
Backtesting:
Fitting a time series model to data from one time period and then testing how well it predicts observed values from a subsequent period
Cross-sectional models:
Based on dependent variables, can be fitted using one set of the data (training set) and tested using an independent data set (not necessarily from a different time period)
R^2 = \dfrac{SSR}{SST} = 1 - \dfrac{SSE}{SST} \in [0,1]
Where:
\begin{array}{c,c,c,c,c,c} &SST &= &SSR &+ &SSE \\ & \sum \limits_{t=1}^T(Y_t - \bar{Y})^2 &= &\sum \limits_{t=1}^T(\hat{Y}_t - \bar{Y})^2 &+ &\sum \limits_{t=1}^T(\underbrace{Y_t - \hat{Y}_t}_{\epsilon_t^2})^2 \\ \end{array}
SST: = total sum of squares
SSR = sum of squares explained by regression
SSE = sum of squares error (unexplained deviations)
Average of the observation: \bar{Y} = \dfrac{1}{T}\sum \limits_{t=1}^T Y_t
Predicted value: \hat{Y}_t = \sum \limits_{n=1}^N b_n x_{t,n}
R_{\alpha}^2 = 1 - \dfrac{SSE/(T-N)}{SST/(T-1)} = 1 - \dfrac{T-1}{T-N}(1-R^2)↩
\dfrac{SSR/(N-1)}{SSE/(T-N)} = dfrac{R^2/(N-1)}{(1-R^2)/(T-N)} \sim F^{N-1}_{T-N}↩
s^2 = \dfrac{SSE}{T-N}
Where:
Sample covariance matrix for the vector of estimates \mathbf{b} is \mathbf{S_b}= s^2(\mathbf{X}'\mathbf{X})^{-1}
Square root of the nth element of the diagonal in $_b is s_{b_n}, the standard error of the estimate b_n
Assuming normally distributed error term, we can use the t-test with the following statistics:
\dfrac{b_n - \beta_n}{s_{b_n}} \sim t_{T-N}
Typical confidence level is 90%↩