20110809 svm in pd prediction

7/28/2019 20110809 SVM in PD Prediction

1/52

Genetic Algorithm forSupport Vector Machines Optimization

in Probability of Default PredictionWolfgang Hrdle

Dedy Dwi Prastyo

Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. Center for Applied Statisticsand Economics

HumboldtUniversitt zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de
http://lvb.wiwi.hu-berlin.de/http://www.case.hu-berlin.de/http://www.case.hu-berlin.de/http://lvb.wiwi.hu-berlin.de/


2/52

Introduction 1-1

Classifier

Figure 1: Linear classifier functions (1 and 2) and a non-linear one (3)

Support Vector Machines


3/52

Introduction 1-2

Loss

Nonlinear classifier function f be described by a function class

Ffixed a priori, i.e. class of linear classifiers (hyperplanes)

Loss

L(x,y) =1

2|f(x) y| =

0, if classification is correct,

1, if classification is wrong.



4/52

Introduction 1-3

Expected and Empirical Risk

Expected risk expected value of loss under the trueprobability measure

R(f) = 12

|f(x) y| dF(x,y)

Empirical risk average value of loss over the training set

R(f) = 1n

ni=1

12

|f(xi) yi|



5/52

Introduction 1-4

VC bound

Vapnik-Chervonenkis (VC) bound there is a function (monotone increasing in VC dimension h) so that for all f

Fwith

probability 1 hold

R(f)

R(f) +

h

n,

log()

n



6/52

Outline

1. Introduction

2. Support Vector Machine (SVM)3. Feature Selection

4. Application

5. Conclusions



7/52

Support Vector Machines 3-1

SVM

Classification

Data Dn = {(x1,y1) , . . . , (xn,yn)} : (X Y)n

X Rd and Y {1, 1}

Goal to predict Yfor new observation, x X, based oninformation in Dn



8/52


Linearly (Non-) Separable Case detail

0

Margin ( d )

Figure 2: Hyperplane and its margin in linearly (non-) separable case



9/52


SVM Dual Problem

max LD () = max n

i=1

i 1

2

n

i=1

n

j=1

ijyiyjx

i xj ,s.t. 0 i C

n

i=1 iyi = 0



10/52


Data Space Feature Space

Figure 3: Mapping two dimensional data space into a three dimensional fea-

ture space, R2 R3. The transformation (x1, x2) = (x21 , 2x1x2, x22 )corresponds to K(xi, xj) = (x

i xj)2



11/52


Non-Linear SVM

max

LD () = max

n

i=1 i 1

2

n

i=1n

j=1 ijyiyjK(xi, xj)

s.t. 0 i C,n

i=1

iyi = 0

Gaussian RBF kernel K(xi, xj) = exp 1 xi xj2 Polynomial kernel K(xi, xj) =

xi xj + 1

pSupport Vector Machines


12/52

Feature Selection 4-1

Structural Risk Minimization (SRM)

Search for the model structure Sh,

Sh1 Sh2 . . . Sh . . . Shk = F

such that f Sh minimises the expected risk bound, with f Fis class of linear function and h is VC dimensioni.e.

SVM(h1) . . . SVM(h) . . . SVM(hk) = Fwith h correspond to the value of SVM (kernel) parameter



13/52


Evolutionary Feature Selection GA

Featured selection SVM parameters optimization Evolutionary optimization Genetic Algorithm (GA)

GA finds global optimum solution



14/52


GA - SVM

mating poolpopulation

Generating Evaluation(fitness)

SVM

SVM

SVM

MODEL

Learning

Learning

Learning

DATA

selectioncrossover

mutation

Figure 4: Iteration (generation) in GA-SVM



15/52

Application 5-1

Credit Scoring & Probability of Default

Score (Sc) from SVM method

Sc(x) =n

i=1 iyiK(xi, x) Probability of Default (PD)

f(y = 1|Sc) = 11 + exp (0 + 1Sc)

0 and 1 are estimated by minimizing the negativelog-likelihood function (Karatzoglou and Meyer, 2006)



16/52

Application 5-2

Validation of Scores

Discriminatory power (of the score)

Cumulative Accuracy Profile (CAP) curve Receiver Operating Characteristic (ROC) curve Accuracy, Specificity, Sensitivity



17/52

Application 5-3

1

1

Model with zeropredictive power

Perfect model

Random model

Model beingevaluted /

actual

Sample rank basedon its score

0

1

1

Model with zero

predictive power

Perfect model

Random model

ROCcurve

false alarm rate

hit

ra

te

0

Figure 5: CAP curve (left) and ROC curve (right)



18/52

Application 5-4

Discriminatory power

Cumulative Accuracy Profile (CAP) curve

CAP/Power/Lorenz curve Accuracy Ratio (AR) Total sample vs. default sample

Receiver Operating Characteristic (ROC) curve

ROC curve Area Under Curve (AUC) Non-default sample vs. default sample

Relationship: AR = 2 AUC - 1



19/52

Application 5-5

Discriminatory power (contd)

sampledefault non-default

(1) (-1)

predicted

(1) True Positive (TP) False Positive (FP)

(-1) False Negative (FN) True Negative (TN)total P N

Accuracy, P(

Y = Y) = TP+TN

P+N

Specificity, P(Y = 1|Y = 1) = TNN Sensitivity, P(Y = 1|Y = 1) = TP

P



20/52

Application 5-6

Examples Small Sample

100 solvent and insolvent companies

X3 Operating Income / Total Asset

X24 Account Payable / Total Asset



21/52

Application 5-7

1.0

0.5

0.0

0.5

1.0

0.00 0.10 0.20 0.30

0.2

0.1

0.0

0.1

0.2

qq

q

qq

qqq

qq

qqq

q

q

q

q

q

q

q

q

q

qq

q

q

q

q

q

q

q

q

q

q

q

q

q

q

q

qq

q

q

q

SVM classification plot

x24

x3

1.5

1.0

0.5

0.0

0.5

1.0

1.5

0.00 0.10 0.20 0.30

0.2

0.1

0.0

0.1

0.2

qq

q

q

q

q

q

q

q

q

q

q

qqq

q

q

q

q

q qq

q

q

q

q

q

q

q

q

q

q

q

q

q

qq

q

qq

q

q

q

qq

SVM classification plot

x24

x3

Figure 6: SVM plot, C = 1 and = 1/2, training error 0.19 (left) andGA-SVM, C = 14.86 and = 1/121.61, training error 0 (right).



22/52

Application 5-8

Credit reform data

type solvent (%) insolvent (%) total (%)

Manufacturing 27.37 (26.06) 25.70 (1.22) 27.29Construction 13.88 (13.22) 39.70 (1.89) 15.11Wholesale and retail 24.78 (23.60) 20.10 (0.96) 24.56

Real estate 17.28 (16.46) 9.40 (0.45) 16.90total 83.31 (79.34) 94.90 (4.52) 83.86

others 16.69 (15.90) 5.10 (0.24) 16.14

# 20,000 1,000 21,000

Table 1: Credit reform data



23/52

Application 5-9

Pre-processing

year solvent insolvent total# (%) # (%) # (%)

1997 872 ( 9.08) 86 (0.90) 958 ( 9.98)1998 928 ( 9.66) 92 (0.96) 1020 (10.62)

1999 1005 (10.47) 112 (1.17) 1117 (11.63)2000 1379 (14.36) 102 (1.06) 1481 (15.42)2001 1989 (20.71) 111 (1.16) 2100 (21.87)2002 2791 (29.07) 135 (1.41) 2926 (30.47)

total 8964 (93.36) 638 (6.64) 9602 (100)Table 2: Pre-processed credit reform data



24/52

Application 5-10

Scenario

scenario training set testing set

Scenario-1 1997 1998

Scenario-2 1997-1998 1999Scenario-3 1997-1999 2000Scenario-4 1997-2000 2001Scenario-5 1997-2001 2002

Table 3: Training and testing data set


l


25/52

Application 5-11

Full model, X1, . . . , X28

Predictors 28 financial ratio variables

Population (# solutions) 20

Evolutionary iteration (generation) 100

Elitism 0.2 of population

Crossover rate 0.5, mutation rate 0.1

Optimal SVM parameters = 1/178.75 andC

= 63.44


A li i 5 12


26/52

Application 5-12

Quality of classification (1/2)

sampletraining testing

Disc. power

AR 1 1AUC 1 1

Accuracy 1 1Specificity 1 1Sensitivity 1 1

Table 4: Discriminatry power of Scenario-1, 2, 3, 4, 5


A li ti 5 13


27/52

Application 5-13

Quality of classification (2/2)

training TE (CV) testing TE (CV)

1997 0 (8.98) 1998 0 ( 9.02)1997-1998 0 (8.99) 1999 0 (10.03)1997-1999 0 (9.37) 2000 0 ( 6.89)1997-2000 0 (8.57) 2001 0 ( 5.29)1997-2001 0 (4.55) 2002 0 ( 4.61)

Table 5: Percentage of Training Error (TE) and Cross-Validation (CV, with

group=5)


C l i 6 1


28/52

Conclusion 6-1

Conclusion

Optimal feature selection (via Genetic Algorithm) leads to

perfect classification

Cross validation overcome the overfiting in training & testingerror


G ti Al ith f


29/52

Genetic Algorithm forSupport Vector Machines Optimizationin Probability of Default Prediction

Wolfgang Hrdle

Dedy Dwi Prastyo

Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. Center for Applied Statisticsand EconomicsHumboldtUniversitt zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de

References 7 1
http://lvb.wiwi.hu-berlin.de/http://www.case.hu-berlin.de/http://www.case.hu-berlin.de/http://lvb.wiwi.hu-berlin.de/


30/52

References 7-1

References

Chen, S., Hrdle, W. and Moro, R.Estimation of Default Probabilities with Support Vector

MachinesQuantitative Finance, 2011, 11, 135 - 154

Holland, J.H.Adaptation in Natural and Artificial Systems

University of Michigan Press, 1975


References 7 2


31/52

References 7-2

References

Karatzoglou, A. and Meyer, D.Support Vector Machines in R

Journal of Statistical Software, 2006, 15:9, 1-28Zhang, J. L. and Hrdle, W.The Bayesian Additive Classification Tree Applied to CreditRisk ModellingComputational Statistics and Data Analysis, 2010, 54,

1197-1205


Appendix Linearly Separable Case 9-1


32/52

Appendix Linearly Separable Case 9 1

Linearly Separable Case back

0

Margin ( d )

Figure 7: Separating hyperplane and its margin in linearly separable case




33/52


Choose f Fsuch that margin (d + d+) is maximal No error separation, if all i = 1, 2, ..., n satisfy

xi w+ b +1 for yi = +1xi w+ b

1 for yi =

1

Both constraints are combined into

yi(x

i w+ b) 1 0 i = 1, 2, ..., n




34/52


Distance between margins and the separating hyperplane isd+ = d = 1/w

Maximize the margin, d+ + d = 2/w, could be attained byminimizing w or w2

Lagrangian for the primal problem

LP (w, b) =1

2w2

ni=1

i{yi(xi w+ b) 1}




35/52

pp y p

Karush-Kuhn-Tucker (KKT) first order optimality conditions

LPwk

= 0 : wk n

i=1

iyixik = 0 k= 1, ...,d

LP

b

= 0 :n

i=1 iyi = 0yi(x

i w+ b) 1 0 i = 1, ..., ni 0

i

{yi(x

i w+ b)

1

}= 0




36/52

pp y p

Solution w =n

i=1 iyixi, therefore

1

2w2

=

1

2

n

i=1

n

j=1

ijyiyjx

i xj

n

i=1

i{yi(xi w+ b) 1} = n

i=1

iyix

i

nj=1

jyjxj +n

i=1

i

= ni=1

nj=1

ijyiyjx

i xj +n

i=1

i

Lagrangian for the dual problem

LD () =n

i=1

i 12

ni=1

nj=1

ijyiyjx

i xj




37/52

Primal and dual problems

minw,b

LP (w, b)

max

LD () s.t. i 0,n

i=1 iyi = 0 Optimization problem is convex, therefore the dual and primal

formulations give the same solution

Support vector, a point i for which yi(x

i w+ b) = 1 holds


Appendix Linearly Non-separable Case 10-1


38/52

Linearly Non-separable Case back

0

Figure 8: Hyperplane and its margin in linearly non-separable case




39/52

Slack variables i represent the violation from strict separation

x

i w+ b 1 i for yi = 1,xi w+ b 1 + i for yi = 1,

i 0

constraints are combined into

yi(x

i w+ b) 1 i and i 0

If i > 0, the objective function is

1

2w2 + C

ni=1

i




40/52

Lagrange function for the primal problem

LP (w, b, ) = 12

w2 + Cn

i=1

in

i=1i{yi

xi w+ b

1 + i}

n

i=1ii,

where i 0 and i 0 are Lagrange multipliers

Primal problem

minw,b,

LP (w, b, )




41/52

First order conditions

LPwk = 0 : wk

ni=1

iyixik = 0

LPb

= 0 :n

i=1

iyi = 0

LPi

= 0 : C i i = 0

s.t. i

0, i

0, ii = 0

i{yi(xi w+ b) 1 + i} = 0




42/52

Note thatn

i=1 iyib= 0. Translate primal problem into

LD () =n

i=1

i 12

ni=1

nj=1

ijyiyjxi xj +n

i=1

i (C i i)

Last term is 0, therefore the dual problem is

max

LD () = max

ni=1

i 12

ni=1

nj=1

ijyiyjx

i xj

,

s.t. 0 i C,n

i=1

iyi = 0

back


AppendixGenetic Algorithm 11-1


43/52

What is a Genetic Algorithm ?

Genetics algorithm is search

and optimization techniquebased on Darwins principleon natural selection(Holland, 1975)




44/52

GA Initialization Back

f obj

solution

1 1 1 1 0 1 00 1 1 0 1 1 0

0 0 1 0 0 1 11 0 0 0 1 0 10 1 0 1 1 0 0

1 1 1 0 0 0 11 1 0 1 1 1 01 0 1 1 0 0 10 0 1 0 1 1 11 1 0 1 0 0 1

binary encodingpopulation

decoding

global maximum

global minimum

12277

541969441131108923105

real number( solution )

Figure 9: GA at first generation




45/52

GA Convergency

f obj

solution

f obj

solution

Figure 10: Solutions at 1st generation (left) and rth generation (right)




46/52

GA Decoding

Figure 11: Decoding

= lower + (upper lower)l1i=0 ai2

i

2l

where is solution (i.e. parameter C or ), a is allele




47/52

GA Fitness evaluation

Calculate f(i), i = 1, . . . , popsize Evaluate fitness, fdp(i)

fdp(i) AR, AUC, accuracy, specificity, sensitivity

Relative fitness, pi =fdp(

i)popsize

k=ifdp(i)

S S

S

$

L

SS

LS

#

SRSVL]HS

Figure 12: Proportion to be choosen in the next iteration (generation)




48/52

GA Roulette wheel

S S

S

$

L

SS

LS

#

SRSVL]HS

[ , ] ( )=( )( ) ( ) ( )

rand U(0, 1)

Select ith

chromosome ifki=1 pi < rand < k+1i=1 pi Repeat popsize times to get popsize new chromosomes




49/52

GA Crossover

1

1

1

12

2

2

2

Figure 13: Crossover in nature

Cross point

Figure 14: Randomly choosen

one-point crossover (top) and

two-points crossover (bottom)




50/52

GA Reproductive operator[ , ]

O

3DUHQWV 2IIVSULQJ

O

O

2IIVSULQJRIFURVVRYHU 0XWDQW

M

Figure 15: One-point crossover (top) and bit-flip mutation (bottom)




51/52

GA Elitism

Best solution in each iteration is maintained in anothermemory place

New population replaces the old one, check whether bestsolution is in the population

If not, replace any one in the population with best solution




52/52

Nature to Computer Mapping Back

Nature GA-SVM

Population Set of parameterIndividual (phenotype) Parameters

Fitness Discriminatory powerChromosome (genotype) Encoding of parameterGene Binary encodingReproduction CrossoverGeneration Iteration

Table 6: Nature to GA-SVM mapping


20110809 svm in pd prediction

Documents