variable selection and optimization in default...

73

Upload: others

Post on 20-Jun-2020

38 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Variable Selection and Optimizationin Default Prediction

Dedy Dwi Prastyo

Wolfgang Karl Härdle

Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. Center for Applied Statisticsand EconomicsHumboldtUniversität zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de

Page 2: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Introduction 1-1

Credit-Worthiness Jury

NO

NO

NO

NO

Yes

Maybe

I like her.

Figure 1: Who is more precise?

Default Prediction

Page 3: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Introduction 1-2

Credit rating

Score (S)

Quantitative indicator for customers w.r.t. their individualdefault risk

Probability of Default (PD)

One-to-one mapping of the score, S → PD(S)

Rating

Classication of customers (private, corporate, sovereign) intogroups of equivalent default risk

Default Prediction

Page 4: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Introduction 1-3

Default prediction

Time-series data (market data)

I Merton approach (stock price as estimate for the marketvalue), S = distance to default

Cross-sectional data (i.e. balance sheet)

I Discriminant analysisI Categorical regression (logit, probit)I Support Vector Machines (SVM)

Default Prediction

Page 5: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Introduction 1-4

Research questions

What are the variables (i.e. accounting) signicantlycontribute to default ?

How to optimize the default prediction (classication) ?

Default Prediction

Page 6: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Outline

1. Introduction X

2. Variable Selection Regularized GLM, elastic net

3. Evolutionary optimization

4. Genetic Algorithm SVM

Default Prediction

Page 7: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Variable selection 2-1

Some problems

Number of predictors is greater than number of observation,p n

There are correlated variables

Sparsity (elements of predictor matrix X ≈ 0)

VLDS (very large data set)

Default Prediction

Page 8: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Variable selection 2-2

Model with convex penalty

Apply fast algorithm to estimate

Model

I Linear regressionI Two-class logistic regression

Penalties

I Lasso (`1)I Ridge regression (`2)I Elastic net (mixture of `1 and `2)

Default Prediction

Page 9: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Variable selection 2-3

Linear regression

Suppose Y ∈ R and X ∈ Rp

E (Y |X = x) = β0 + x>β,

If θ = (β0, β), a penalty Pα(β) and multiplier λ, then

θ = argmin

1

2n

n∑i=1

(yi − β0 − x>i β

)2+ λPα(β)

(1)

Default Prediction

Page 10: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Variable selection 2-4

Elastic net

The penalty is a compromise between ridge and lasso

Pα(β) =1

2(1− α) ‖β‖2`2 + α ‖β‖`1

=

p∑j=1

1

2(1− α)β2j + α|βj |

(2)

Be a ridge regression, if (α = 0), and lasso, if (α = 1).Useful when p n or there are many correlated variables.

Default Prediction

Page 11: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Variable selection 2-54 Regularization Paths for GLMs via Coordinate Descent

Figure 1: Leukemia data: profiles of estimated coefficients for three methods, showing onlyfirst 10 steps (values for λ) in each case. For the elastic net, α = 0.2.

Figure 2: Prole of estimated coecients, λ = 1, . . . , 10, elastic net (α =

0.2), on Leukimia data, p = 72 and n = 3571 (Friedman et al., 2010)Default Prediction

Page 12: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Variable selection 2-6

Binary logit

Suppose p(xi ) = P(Y = 1|xi ), with Y ∈ 0, 1,

P(Y = 1|x) =1 + e−(β0+x>β)

−1, (3)

P(Y = 0|x) =1 + e(β0+x>β)

−1,

log

P(Y = 1|x)

P(Y = 0|x)

= β0 + x>β

Default Prediction

Page 13: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Variable selection 2-7

Penalized log likelihood

maxβ0,β

1

n

n∑i=1

`(β0, β)− λPα(β)

(4)

where

`(β0, β) =1

n

n∑i=1

yi (β0 + x>i β)− log(1 + eβ0+x>iβ), (5)

is a concave function of the parameter.Maximizing `(β0, β) iteratively reweightet least squares (IRLS).

Default Prediction

Page 14: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Variable selection 2-8

Newton algorithm

If current parameter are (β0, β), the quadratic approximation to`(β0, β) is,

`Q(β0, β) = − 1

2n

n∑i=1

wi (zi − β0 − x>i β)2 + C (β0, β)2 (6)

where working response and wheight are,

zi = β0 + x>i β +yi − p(xi )

p(xi )(1− p(xi ))

wi = p(xi )(1− p(xi ))

Newton update is obtained by minimizing `Q(β0, β).

Default Prediction

Page 15: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Variable selection 2-9

Friedman approach

Coordinate descent is used to solve the penalized weightedleast-square (PWLS)

minβ0,β−`Q(β0, β) + λPα(β)

Sequence of nested loops:

Outer loop: Decrement λ

Middle loop: Update `Q using current parameter (β0, β)

Inner loop: Run coordinate descent algorithm on PWLS

Default Prediction

Page 16: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Evolutionary optimization 3-1

Global optimum

Coordinate descent search local minimum

How to choose α (and λ) ?

More complicated

Default Prediction

Page 17: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Evolutionary optimization 3-2

Evolutionary optimizationp

p p

Genetic Algorithm (GA)

GA nds global optimum solution parameters

Default Prediction

Page 18: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Evolutionary optimization 3-3

What is a Genetic Algorithm ?

´ p p p

:

;06å

*

Genetics algorithm is searchand optimization techniquebased on Darwin's principleon natural selection(Holland, 1975) GA

Default Prediction

Page 19: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

GA-SVM 4-1

Classier

.

Figure 3: Linear classier functions (1 and 2) and a non-linear one (3)

Default Prediction

Page 20: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

GA-SVM 4-2

SVM

ClassicationData Dn = (x1, y1) , . . . , (xn, yn) : Ω→ (X × Y)n

X ⊆ Rd and Y ∈ −1, 1

Goal to predict Y for new observation, x ∈ X , based oninformation in Dn

Default Prediction

Page 21: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

GA-SVM 4-3

Linearly (Non-) Separable Case details

0

Margin ( d )

Figure 4: Hyperplane and its margin in linearly (non-) separable case

Default Prediction

Page 22: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

GA-SVM 4-4

Loss function

L(y , f (x)) Loss type

1− yf (x)2 quadratic loss (ridge regression)1− yf (x)+ = max 0, 1− yf (x) hinge loss (SVM)log1 + exp(−yf (x)) log-loss (logistic regression)sign−yf (x) 0, 1 loss1/ 1 + exp(yf (x)) sigmoidal loss

Table 1: Types of Loss function, with f (x) = x>w + b is a score

Default Prediction

Page 23: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

GA-SVM 4-5

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0

01

23

4

y f(x)

Loss

(fo

r y=

+1)

hingequadraticLog0,1sigmoid

Figure 5: Plot of loss function for y = 1, f (x) = 0.2x1, x1 ∈ [−5, 10].

Similar plot for y = −1 but in opposite direction of yf (x) axis

Default Prediction

Page 24: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

GA-SVM 4-6

SVM dual problem

maxα

LD (α) = maxα

n∑

i=1

αi −1

2

n∑i=1

n∑j=1

αiαjyiyjx>i xj

,

s.t. 0 ≤ αi ≤ Cn∑

i=1

αiyi = 0

Default Prediction

Page 25: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

GA-SVM 4-7

Data Space Feature Space

Figure 6: Mapping two dimensional data space into a three dimensional fea-

ture space, R2 7→ R3. The transformation Ψ(x1, x2) = (x21,√2x1x2, x

2

2)>

corresponds to K (xi , xj) = (x>i xj)2

Default Prediction

Page 26: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

GA-SVM 4-8

Non-linear SVM

maxα

LD (α) = maxα

n∑

i=1

αi −1

2

n∑i=1

n∑j=1

αiαjyiyjK (xi , xj)

s.t. 0 ≤ αi ≤ C ,

n∑i=1

αiyi = 0

Gaussian RBF kernel K (xi , xj) = exp(− 1

σ ‖xi − xj‖2)

Polynomial kernel K (xi , xj) =(x>i xj + 1

)pDefault Prediction

Page 27: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

GA-SVM 4-9

Structural Risk Minimization (SRM)

Search for the model structure Sh,

Sh1 ⊆ Sh2 ⊆ . . . ⊆ Sh∗ ⊆ . . . ⊆ Shk = F

such that f ∈ Sh∗ minimises the expected risk bound, with f ⊆ Fis class of linear function and h is VC dimensioni.e.

SVM(h1) ⊆ . . . ⊆ SVM(h∗) ⊆ . . . ⊆ SVM(hk) = F

with h correspond to the value of SVM (kernel) parameter

Default Prediction

Page 28: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

GA-SVM 4-10

GA - SVM

mating poolpopulation

Generating Evaluation(fitness)

SVM

SVM

SVM

MODEL

Learning

Learning

Learning

DATA

selectioncrossover

mutation

Figure 7: Iteration (generation) in GA-SVM

Default Prediction

Page 29: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Application 5-1

Validation of scores

Discriminatory power (of the score)

I Cumulative Accuracy Prole (CAP) curveI Receiver Operating Characteristic (ROC) curveI Accuracy, Specicity, Sensitivity

Default Prediction

Page 30: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Application 5-2

1

1

Model with zeropredictive power

Perfect model

Random model

Model beingevaluted /actual

0

1

1

Model with zeropredictive power

Perfect model

Random model

ROCcurve

0

1

1

Model with zeropredictive power

Perfect model

Random model

Model beingevaluted /actual

0

1

1

Model with zeropredictive power

Perfect model

Random model

ROCcurve

0

Figure 8: CAP curve (left) and ROC curve (right)

Default Prediction

Page 31: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Application 5-3

Discriminatory power

Cumulative Accuracy Prole (CAP) curve

I CAP/Power/Lorenz curve → Accuracy Ratio (AR)I Total sample vs. default sample

Receiver Operating Characteristic (ROC) curve

I ROC curve → Area Under Curve (AUC)I Non-default sample vs. default sample

Relationship: AR = 2 AUC - 1

Default Prediction

Page 32: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Application 5-4

Discriminatory power (cont'd)

sampledefault non-default(1) (-1)

predicted(1) True Positive (TP) False Positive (FP)(-1) False Negative (FN) True Negative (TN)

total P N

I Accuracy, P(Y = Y ) = TP+TNP+N

I Specicity, P(Y = −1|Y = −1) = TNN

I Sensitivity, P(Y = 1|Y = 1) = TPP

Default Prediction

Page 33: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Application 5-5

Credit reform data Example

type solvent (%) insolvent (%) total (%)

Manufacturing 27.37 (26.06) 25.70 (1.22) 27.29Construction 13.88 (13.22) 39.70 (1.89) 15.11Wholesale and retail 24.78 (23.60) 20.10 (0.96) 24.56Real estate 17.28 (16.46) 9.40 (0.45) 16.90

total 83.31 (79.34) 94.90 (4.52) 83.86

others 16.69 (15.90) 5.10 (0.24) 16.14

# 20,000 1,000 21,000

Table 2: Credit reform data

Default Prediction

Page 34: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Application 5-6

Pre-processing

year solvent insolvent total# (%) # (%) # (%)

1997 872 ( 9.08) 86 (0.90) 958 ( 9.98)1998 928 ( 9.66) 92 (0.96) 1020 (10.62)1999 1005 (10.47) 112 (1.17) 1117 (11.63)2000 1379 (14.36) 102 (1.06) 1481 (15.42)2001 1989 (20.71) 111 (1.16) 2100 (21.87)2002 2791 (29.07) 135 (1.41) 2926 (30.47)

total 8964 (93.36) 638 (6.64) 9602 (100)

Table 3: Pre-processed credit reform data

Default Prediction

Page 35: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Application 5-7

Full model, X1, . . . ,X28

Predictors 28 nancial ratio variables detail

Population (# solutions) 20

Evolutionary iteration (generation) 100

Elitism 0.2 of population

Crossover rate 0.5, mutation rate 0.1

Optimal SVM parameters σ = 1/178.75 and C = 63.44

Default Prediction

Page 36: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Application 5-8

Scenario Finding

scenario training set testing set

Scenario-1 1997 1998Scenario-2 1997-1998 1999Scenario-3 1997-1999 2000Scenario-4 1997-2000 2001Scenario-5 1997-2001 2002

Table 4: Training and testing data set

Default Prediction

Page 37: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Application 5-9

Quality of classication

training TE (CV) testing TE (CV)

1997 0 (8.98) 1998 0 ( 9.02)1997-1998 0 (8.99) 1999 0 (10.03)1997-1999 0 (9.37) 2000 0 ( 6.89)1997-2000 0 (8.57) 2001 5.43 ( 5.86)1997-2001 0 (4.55) 2002 4.68 ( 4.61)

Table 5: Percentage of Training Error (TE) and Cross-Validation (5-fold

CV)

Default Prediction

Page 38: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Current ndings 6-1

Current ndings

SVM with optimal parameter is more robust to the imbalanceddata set

Evolutionary feature selection (using Genetic Algorithm) couldnd global solution of SVM parameter optimization

More investigation to variable selection

Default Prediction

Page 39: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Variable Selection and Optimizationin Default Prediction

Dedy Dwi Prastyo

Wolfgang Karl Härdle

Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. Center for Applied Statisticsand EconomicsHumboldtUniversität zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de

Page 40: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

References 7-1

References

Chen, S., Härdle, W. and Moro, R.Estimation of Default Probabilities with Support VectorMachinesQuantitative Finance, 2011, 11, 135 - 154

Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R.Least angle regressionThe Annals of Statistics, 2004, 32(2), 407 - 449

Friedman, J., Hastie, T., Hoeing, H. and Tibshirani, R.Pathwise coordinate optimizationThe Annals of Applied Statistics, 2007, 2(1), 302 - 332

Default Prediction

Page 41: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

References 7-2

References back

Friedman, J., Hastie, T. and Tibshirani, R.Regularization path for generalized linear models via coordinatedescentJournal of Statistical Software, 2010, 33(1)

Härdle, Lee, Y-J., Schäfer, D. and Yeh, Y-R.Variable Selection and Oversampling in the Use of SmoothSupport Vector Machine for Predicting the Default Risk ofCompaniesJournal of Forecasting, 2009, 28, 512 - 534

Holland, J.H.Adaptation in Natural and Articial SystemsUniversity of Michigan Press, 1975

Default Prediction

Page 42: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

References 7-3

References

Karatzoglou, A. and Meyer, D.Support Vector Machines in RJournal of Statistical Software, 2006, 15:9, 1-28

Rosset, S. and Zhu, J.Piecewise linear regularized solution pathThe Annals of Statistics, 2007, 35(3), 1012 - 1030

Tibshirani, R.Regression shrinkage and selection via LassoJournal of the Royal Statistical Society B, 1996, 58, 267 - 288

Default Prediction

Page 43: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

References 7-4

References

Tseng, P.Convergence of a block coordinate descent method fornondierentiable minimizationJournal of Optimization Theory and Application, 2001, 109,475 - 494

Van der Kooij, A.Prediction accuracy and stability of regression with optimalscaling transformationPh.D. thesis, Department of data theory, University of Leiden,2001

Default Prediction

Page 44: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Linearly Separable Case 8-1

Linearly Separable Case back

0

Margin ( d )

Figure 9: Separating hyperplane and its margin in linearly separable case

Default Prediction

Page 45: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Linearly Separable Case 8-2

Choose f ∈ F such that margin (d− + d+) is maximal

No error separation, if all i = 1, 2, ..., n satisfy

x>i w + b ≥ +1 for yi = +1

x>i w + b ≤ −1 for yi = −1

Both constraints are combined into

yi (x>i w + b)− 1 ≥ 0 i = 1, 2, ..., n

Default Prediction

Page 46: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Linearly Separable Case 8-3

Distance between margins and the separating hyperplane isd+ = d− = 1/‖w‖

Maximize the margin, d+ + d− = 2/‖w‖, could be attained byminimizing ‖w‖ or ‖w‖2

Lagrangian for the primal problem

LP (w , b) =1

2‖w‖2 −

n∑i=1

αiyi (x>i w + b)− 1

Default Prediction

Page 47: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Linearly Separable Case 8-4

Karush-Kuhn-Tucker (KKT) rst order optimality conditions

∂LP∂wk

= 0 : wk −n∑

i=1

αiyixik = 0 k = 1, ..., d

∂LP∂b

= 0 :n∑

i=1

αiyi = 0

yi (x>i w + b)− 1 ≥ 0 i = 1, ..., n

αi ≥ 0

αiyi (x>i w + b)− 1 = 0

Default Prediction

Page 48: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Linearly Separable Case 8-5

Solution w =∑n

i=1αiyixi , therefore

1

2‖w‖2 =

1

2

n∑i=1

n∑j=1

αiαjyiyjx>i xj

−n∑

i=1

αiyi (x>i w + b)− 1 = −n∑

i=1

αiyix>i

n∑j=1

αjyjxj +n∑

i=1

αi

= −n∑

i=1

n∑j=1

αiαjyiyjx>i xj +

n∑i=1

αi

Lagrangian for the dual problem

LD (α) =n∑

i=1

αi −1

2

n∑i=1

n∑j=1

αiαjyiyjx>i xj

Default Prediction

Page 49: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Linearly Separable Case 8-6

Primal and dual problems

minw ,b

LP (w , b)

maxα

LD (α) s.t. αi ≥ 0,n∑

i=1

αiyi = 0

Optimization problem is convex, therefore the dual and primalformulations give the same solution

Support vector, a point i for which yi (x>i w + b) = 1 holds

Default Prediction

Page 50: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Linearly Non-separable Case 9-1

Linearly Non-separable Case back

0

Figure 10: Hyperplane and its margin in linearly non-separable case

Default Prediction

Page 51: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Linearly Non-separable Case 9-2

Slack variables ξi represent the violation from strict separation

x>i w + b ≥ 1− ξi for yi = 1,

x>i w + b ≤ −1 + ξi for yi = −1,ξi ≥ 0

constraints are combined into

yi (x>i w + b) ≥ 1− ξi and ξi ≥ 0

If ξi > 0, the objective function is

1

2‖w‖2 + C

n∑i=1

ξi

Default Prediction

Page 52: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Linearly Non-separable Case 9-3

Lagrange function for the primal problem

LP (w , b, ξ) =1

2‖w‖2 + C

n∑i=1

ξi−

n∑i=1

αiyi(x>i w + b

)− 1 + ξi −

n∑i=1

µiξi ,

where αi ≥ 0 and µi ≥ 0 are Lagrange multipliers

Primal problem

minw ,b,ξ

LP (w , b, ξ)

Default Prediction

Page 53: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Linearly Non-separable Case 9-4

First order conditions

∂LP∂wk

= 0 : wk −n∑

i=1

αiyixik = 0

∂LP∂b

= 0 :n∑

i=1

αiyi = 0

∂LP∂ξi

= 0 : C − αi − µi = 0

s.t. αi ≥ 0, µi ≥ 0, µiξi = 0

αiyi (x>i w + b)− 1 + ξi = 0

Default Prediction

Page 54: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Linearly Non-separable Case 9-5

Note that∑n

i=1αiyib = 0. Translate primal problem into

LD (α) =n∑

i=1

αi −1

2

n∑i=1

n∑j=1

αiαjyiyjx>i xj +

n∑i=1

ξi (C − αi − µi )

Last term is 0, therefore the dual problem is

maxα

LD (α) = maxα

n∑

i=1

αi −1

2

n∑i=1

n∑j=1

αiαjyiyjx>i xj

,

s.t. 0 ≤ αi ≤ C ,

n∑i=1

αiyi = 0

back

Default Prediction

Page 55: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

AppendixGenetic Algorithm 10-1

GA Initialization Back

f obj

solution

1 1 1 1 0 1 00 1 1 0 1 1 00 0 1 0 0 1 1 1 0 0 0 1 0 10 1 0 1 1 0 01 1 1 0 0 0 11 1 0 1 1 1 01 0 1 1 0 0 10 0 1 0 1 1 11 1 0 1 0 0 1

binary encodingpopulation

decoding

global maximum

global minimum

1227754196944

1131108923

105real number( solution )

Figure 11: GA at rst generation

Default Prediction

Page 56: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

AppendixGenetic Algorithm 10-2

GA Convergency

f obj

solution

f obj

solution

Figure 12: Solutions at 1st generation (left) and r th generation (right)

Default Prediction

Page 57: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

AppendixGenetic Algorithm 10-3

GA Decoding

Figure 13: Decoding

θ = θlower + (θupper − θlower )∑l−1

i=0ai2

i

2l

where θ is solution (i.e. parameter), a is allele

Default Prediction

Page 58: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

AppendixGenetic Algorithm 10-4

GA Fitness evaluation

Calculate f (θi ), i = 1, . . . , popsize

Evaluate tness, fdp(θi )

fdp(θi ) AR, AUC, accuracy, specicity, sensitivity

Relative tness, pi =fdp(θ

i )∑popsize

k=ifdp(θi )

20 2 Simple Evolutionary Algorithms

Fig. 2.5 Roulette wheel selection

been done to minimize the selection bias, and we will introduce some of them inlater chapters.

Another consideration about RWS is that the problem needs to be maximum andall the objective values need to be greater than zero so that we can use objectivevalues as fitness values and take RWS as the selection process directly.24

2.2.5 Variation Operators

There are many variation operators to change information in individuals in the mat-ing pool. If information exchange, i.e., gene exchange, is done between two or moreindividuals25, this variation operator is called crossover or recombination. If thegenes of one individual changes on its own, this variation operator is called muta-tion. We will introduce single-point crossover and bit-flip mutation here.

There are two ways to select two individuals in the mating pool to determinewhether or not to cross over them. One is to shuffle the mating pool randomlyand assign individuals 1 and 2 without replacement to be a crossover pair, 3 and4 without replacement to be another pair, etc. The other is to generate a randominteger permutation, per, between [1, popsize]. per(i) = j means the ith element inthe permutation is the jth individual in the mating pool. Then we assign indper(1)and indper(2) without replacement as the first crossover pair, indper(3) and indper(4)without replacement as the second crossover pair, etc.

24 Why do we need two such requirements for RWS to handle objective values directly?25 We will give examples of multiparent crossover in Chap. 3.

Figure 14: Proportion to be choosen in the next iteration (generation)

Default Prediction

Page 59: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

AppendixGenetic Algorithm 10-5

GA Roulette wheel

20 2 Simple Evolutionary Algorithms

Fig. 2.5 Roulette wheel selection

been done to minimize the selection bias, and we will introduce some of them inlater chapters.

Another consideration about RWS is that the problem needs to be maximum andall the objective values need to be greater than zero so that we can use objectivevalues as fitness values and take RWS as the selection process directly.24

2.2.5 Variation Operators

There are many variation operators to change information in individuals in the mat-ing pool. If information exchange, i.e., gene exchange, is done between two or moreindividuals25, this variation operator is called crossover or recombination. If thegenes of one individual changes on its own, this variation operator is called muta-tion. We will introduce single-point crossover and bit-flip mutation here.

There are two ways to select two individuals in the mating pool to determinewhether or not to cross over them. One is to shuffle the mating pool randomlyand assign individuals 1 and 2 without replacement to be a crossover pair, 3 and4 without replacement to be another pair, etc. The other is to generate a randominteger permutation, per, between [1, popsize]. per(i) = j means the ith element inthe permutation is the jth individual in the mating pool. Then we assign indper(1)and indper(2) without replacement as the first crossover pair, indper(3) and indper(4)without replacement as the second crossover pair, etc.

24 Why do we need two such requirements for RWS to handle objective values directly?25 We will give examples of multiparent crossover in Chap. 3.

rand ∼ U(0, 1)

Select i th chromosome if∑k

i=1pi < rand <

∑k+1

i=1pi

Repeat popsize times to get popsize new chromosomes

Default Prediction

Page 60: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

AppendixGenetic Algorithm 10-6

GA Crossover

1

1

1

12

2

2

2

Figure 15: Crossover in nature

Reproduction Operators comparison

• Single point crossover

Cross point• Two point crossover (Multi point crossover)

Reproduction Operators comparison

• Single point crossover

Cross point• Two point crossover (Multi point crossover)

Figure 16: Randomly choosen

one-point crossover (top) and

two-points crossover (bottom)

Default Prediction

Page 61: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

AppendixGenetic Algorithm 10-7

GA Reproductive operator

2.2 Simple Genetic Algorithm 21

Generally, we will assign the probability of crossover pc, called the crossoverrate, to control the possibility of performing a crossover.26

For two individuals selected to cross over, we assign a point between 1 and l −1randomly, where l is the length of the chromosome. This means generating a randominteger in the range [1, l−1]. The genes after the point are changed between parentsand the resulting chromosomes are offspring. We call this operator a single-pointcrossover. Figure 2.6 illustrates this.

))) ) ) ) ) )

) )) ) ) C

) ) ) )

) ) ) ) ) ) )

) )

. /''%Fig. 2.6 Single-point crossover

As can be seen from Fig. 2.6, two new individuals are generated by crossover,which is generally seen as the major exploration mechanism of SGA.

If two parents do not perform a crossover according to probability pc, their off-spring are themselves.

Now we discuss mutation. There are also two ways to implement mutation. Oneway is to open another memory with size popsize to store the results of crossover,and mutation is carried out in that memory. The other way is to mutate the offspringof crossover directly. We use the latter way.

For every gene in an individual, we mutate it with probability pm, called themutation rate.27 Provided gene j needs to be mutated, we make a bit-flip changefor gene j, i.e., 1 to 0 or 0 to 1. We call this operator a bit-flip mutation. Figure 2.7illustrates the bit-flip mutation. The individual after mutation is called the mutant.

) )) ) ) C

) ) ) ) ) )

/''%'

Fig. 2.7 Bit-flip mutation for gene j of the offspring

26 How do we implement the statement “Individual i and individual j cross over with probabilitypc”?27 How do we implement the statement “Gene j mutates with probability pm”?

2.2 Simple Genetic Algorithm 21

Generally, we will assign the probability of crossover pc, called the crossoverrate, to control the possibility of performing a crossover.26

For two individuals selected to cross over, we assign a point between 1 and l −1randomly, where l is the length of the chromosome. This means generating a randominteger in the range [1, l−1]. The genes after the point are changed between parentsand the resulting chromosomes are offspring. We call this operator a single-pointcrossover. Figure 2.6 illustrates this.

))) ) ) ) ) )

) )) ) ) C

) ) ) )

) ) ) ) ) ) )

) )

. /''%Fig. 2.6 Single-point crossover

As can be seen from Fig. 2.6, two new individuals are generated by crossover,which is generally seen as the major exploration mechanism of SGA.

If two parents do not perform a crossover according to probability pc, their off-spring are themselves.

Now we discuss mutation. There are also two ways to implement mutation. Oneway is to open another memory with size popsize to store the results of crossover,and mutation is carried out in that memory. The other way is to mutate the offspringof crossover directly. We use the latter way.

For every gene in an individual, we mutate it with probability pm, called themutation rate.27 Provided gene j needs to be mutated, we make a bit-flip changefor gene j, i.e., 1 to 0 or 0 to 1. We call this operator a bit-flip mutation. Figure 2.7illustrates the bit-flip mutation. The individual after mutation is called the mutant.

) )) ) ) C

) ) ) ) ) )

/''%'

Fig. 2.7 Bit-flip mutation for gene j of the offspring

26 How do we implement the statement “Individual i and individual j cross over with probabilitypc”?27 How do we implement the statement “Gene j mutates with probability pm”?

Figure 17: One-point crossover (top) and bit-ip mutation (bottom)

Default Prediction

Page 62: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

AppendixGenetic Algorithm 10-8

GA Elitism

Best solution in each iteration is maintained in anothermemory place

New population replaces the old one, check whether bestsolution is in the population

If not, replace any one in the population with best solution

Default Prediction

Page 63: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

AppendixGenetic Algorithm 10-9

Nature to Computer Mapping Back

Nature GA-SVM

Population Set of parameterIndividual (phenotype) ParametersFitness Discriminatory powerChromosome (genotype) Encoding of parameterGene Binary encodingReproduction CrossoverGeneration Iteration

Table 6: Nature to GA-SVM mapping

Default Prediction

Page 64: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

AppendixGenetic Algorithm 10-10

Examples

Small sample: 100 solvent and insolvent companies

Credit reform data

X3 Operating Income / Total Asset

X24 Account Payable / Total Asset

Default Prediction

Page 65: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

AppendixGenetic Algorithm 10-11

−1.0

−0.5

0.0

0.5

1.0

0.00 0.10 0.20 0.30

−0.2

−0.1

0.0

0.1

0.2

SVM classification plot

x24

x3

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

0.00 0.10 0.20 0.30

−0.2

−0.1

0.0

0.1

0.2

SVM classification plot

x24

x3

Figure 18: SVM plot, C = 1 and σ = 1/2, misclass. rate 0.19 (left) and

GA-SVM, C = 14.86 and σ = 1/121.61, misclass. rate 0 (right).

Default Prediction

Page 66: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

AppendixGenetic Algorithm 10-12

−10

−5

0

5

0.05 0.10 0.15 0.20

−0.05

0.00

0.05

0.10

0.15

0.20

SVM classification plot

X24

X3

−10

−5

0

5

0.05 0.10 0.15 0.20

−0.05

0.00

0.05

0.10

0.15

0.20

SVM classification plot

X24

X3

Figure 19: GA-SVM (C = 187.93 and σ = 1/195.16) plot of training data,

misclass. rate 2.38%, and testing data, misclass. rate 1.37%. Back

Default Prediction

Page 67: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Financial Ratio Variables 11-1

FR: Protability

Ratio No. Denition Ratio

x1 NI/TA Return on assets (ROA)x2 NI/Sales Net prot marginx3 OI/TA Operating Income/Total assetsx4 OI/Sales Operating prot marginx5 EBIT/TA EBIT/Total assetsx6 (EBIT+AD)/TA EBITDAx7 EBIT/Sales EBIT/Sales

Table 7: Dentions of nancial ratios.

Default Prediction

Page 68: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Financial Ratio Variables 11-2

FR: Leverage

Ratio No. Denition Ratio

x8 Equity/TA Own funds ratio (simple)x9 (Equity-ITGA)/ Own funds ratio (adjusted)

(TA-ITGA-Cash-LB)x10 CL/TA Current liabilities/Total assetsx11 (CL-Cash)/TA Net indebtednessx12 TL/TA Total liabilities/Total assetsx13 Debt/TA Debt ratiox14 EBIT/ Interest coverage ratio

Interest expenses

Table 8: Dentions of nancial ratios.Default Prediction

Page 69: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Financial Ratio Variables 11-3

FR: Liquidity

Ratio No. Denition Ratio

x15 Cash/TA Cash/Total assetsx16 Cash/CL Cash ratiox17 QA/CL Quick ratiox18 CA/CL Current ratiox19 WC/TA Working Capitalx20 CL/TL Current liabilities/

Total liabilities

Table 9: Dentions of nancial ratios.

Default Prediction

Page 70: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Financial Ratio Variables 11-4

FR: Activity

Ratio No. Denition Ratio

x21 TA/Sales Asset turnoverx22 INV/Sales Inventory turnoverx23 AR/Sales Account receivable turnoverx24 AP/Sales Account payable turnoverx25 Log(TA) Log(Total assets)

Table 10: Dentions of nancial ratios.

Default Prediction

Page 71: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Financial Ratio Variables 11-5

FR back

Ratio No. Denition

x26 increase (decrease) in inventories /inventories

x27 increase (decrease) in liabilities /total liabilities

x28 increase (decrease) in cash ows /cash and cash equivalent

Table 11: Dentions of nancial ratios.

Default Prediction

Page 72: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Financial Ratio Variables 11-6

Findings

Härdle et al. (2009): Smooth SVM overall mean of correctpredictions ranging from 70% to 78% (misclassication: 22%to 30%)

Chen, Härdle and Moro (2011):

I Most of the models tested, AR in 43.50% and 60.51%I SVM (grid search optiization): percentage of correctly

classied out-of-sample 71.85%I Logit model: percentage of correctly classied out-of-sample

67.24%

Default Prediction

Page 73: Variable Selection and Optimization in Default Predictionsfb649.wiwi.hu-berlin.de/lvb/hejnice2012/20120209_d2_var_sel_opt.pdf · Variable Selection and Optimization in Default Prediction

Appendix Financial Ratio Variables 11-7

Findings Back

Zang and Härdle (2010):

Performance measure Logit (%) CART (%) BACT (%)

Ovearal misclass. rate 30.2 33.8 26.6Type I misclass. rate 28.3 27.2 27.6Type II misclass. rate 30.3 34.3 26.5AR 52.1 58.7 60.4

Table 12: Average value (of bootstrap) of performance measures

Default Prediction