20110809 svm in pd prediction

Upload: dedy-dwi-prastyo

Post on 03-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 20110809 SVM in PD Prediction

    1/52

    Genetic Algorithm forSupport Vector Machines Optimization

    in Probability of Default PredictionWolfgang Hrdle

    Dedy Dwi Prastyo

    Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. Center for Applied Statisticsand Economics

    HumboldtUniversitt zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de

    http://lvb.wiwi.hu-berlin.de/http://www.case.hu-berlin.de/http://www.case.hu-berlin.de/http://lvb.wiwi.hu-berlin.de/
  • 7/28/2019 20110809 SVM in PD Prediction

    2/52

    Introduction 1-1

    Classifier

    Figure 1: Linear classifier functions (1 and 2) and a non-linear one (3)

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    3/52

    Introduction 1-2

    Loss

    Nonlinear classifier function f be described by a function class

    Ffixed a priori, i.e. class of linear classifiers (hyperplanes)

    Loss

    L(x,y) =1

    2|f(x) y| =

    0, if classification is correct,

    1, if classification is wrong.

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    4/52

    Introduction 1-3

    Expected and Empirical Risk

    Expected risk expected value of loss under the trueprobability measure

    R(f) = 12

    |f(x) y| dF(x,y)

    Empirical risk average value of loss over the training set

    R(f) = 1n

    ni=1

    12

    |f(xi) yi|

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    5/52

    Introduction 1-4

    VC bound

    Vapnik-Chervonenkis (VC) bound there is a function (monotone increasing in VC dimension h) so that for all f

    Fwith

    probability 1 hold

    R(f)

    R(f) +

    h

    n,

    log()

    n

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    6/52

    Outline

    1. Introduction

    2. Support Vector Machine (SVM)3. Feature Selection

    4. Application

    5. Conclusions

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    7/52

    Support Vector Machines 3-1

    SVM

    Classification

    Data Dn = {(x1,y1) , . . . , (xn,yn)} : (X Y)n

    X Rd and Y {1, 1}

    Goal to predict Yfor new observation, x X, based oninformation in Dn

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    8/52

    Support Vector Machines 3-2

    Linearly (Non-) Separable Case detail

    0

    Margin ( d )

    Figure 2: Hyperplane and its margin in linearly (non-) separable case

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    9/52

    Support Vector Machines 3-3

    SVM Dual Problem

    max LD () = max n

    i=1

    i 1

    2

    n

    i=1

    n

    j=1

    ijyiyjx

    i xj ,s.t. 0 i C

    n

    i=1 iyi = 0

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    10/52

    Support Vector Machines 3-4

    Data Space Feature Space

    Figure 3: Mapping two dimensional data space into a three dimensional fea-

    ture space, R2 R3. The transformation (x1, x2) = (x21 , 2x1x2, x22 )corresponds to K(xi, xj) = (x

    i xj)2

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    11/52

    Support Vector Machines 3-5

    Non-Linear SVM

    max

    LD () = max

    n

    i=1 i 1

    2

    n

    i=1n

    j=1 ijyiyjK(xi, xj)

    s.t. 0 i C,n

    i=1

    iyi = 0

    Gaussian RBF kernel K(xi, xj) = exp 1 xi xj2 Polynomial kernel K(xi, xj) =

    xi xj + 1

    pSupport Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    12/52

    Feature Selection 4-1

    Structural Risk Minimization (SRM)

    Search for the model structure Sh,

    Sh1 Sh2 . . . Sh . . . Shk = F

    such that f Sh minimises the expected risk bound, with f Fis class of linear function and h is VC dimensioni.e.

    SVM(h1) . . . SVM(h) . . . SVM(hk) = Fwith h correspond to the value of SVM (kernel) parameter

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    13/52

    Feature Selection 4-2

    Evolutionary Feature Selection GA

    Featured selection SVM parameters optimization Evolutionary optimization Genetic Algorithm (GA)

    GA finds global optimum solution

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    14/52

    Feature Selection 4-3

    GA - SVM

    mating poolpopulation

    Generating Evaluation(fitness)

    SVM

    SVM

    SVM

    MODEL

    Learning

    Learning

    Learning

    DATA

    selectioncrossover

    mutation

    Figure 4: Iteration (generation) in GA-SVM

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    15/52

    Application 5-1

    Credit Scoring & Probability of Default

    Score (Sc) from SVM method

    Sc(x) =n

    i=1 iyiK(xi, x) Probability of Default (PD)

    f(y = 1|Sc) = 11 + exp (0 + 1Sc)

    0 and 1 are estimated by minimizing the negativelog-likelihood function (Karatzoglou and Meyer, 2006)

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    16/52

    Application 5-2

    Validation of Scores

    Discriminatory power (of the score)

    Cumulative Accuracy Profile (CAP) curve Receiver Operating Characteristic (ROC) curve Accuracy, Specificity, Sensitivity

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    17/52

    Application 5-3

    1

    1

    Model with zeropredictive power

    Perfect model

    Random model

    Model beingevaluted /

    actual

    Sample rank basedon its score

    0

    1

    1

    Model with zero

    predictive power

    Perfect model

    Random model

    ROCcurve

    false alarm rate

    hit

    ra

    te

    0

    Figure 5: CAP curve (left) and ROC curve (right)

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    18/52

    Application 5-4

    Discriminatory power

    Cumulative Accuracy Profile (CAP) curve

    CAP/Power/Lorenz curve Accuracy Ratio (AR) Total sample vs. default sample

    Receiver Operating Characteristic (ROC) curve

    ROC curve Area Under Curve (AUC) Non-default sample vs. default sample

    Relationship: AR = 2 AUC - 1

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    19/52

    Application 5-5

    Discriminatory power (contd)

    sampledefault non-default

    (1) (-1)

    predicted

    (1) True Positive (TP) False Positive (FP)

    (-1) False Negative (FN) True Negative (TN)total P N

    Accuracy, P(

    Y = Y) = TP+TN

    P+N

    Specificity, P(Y = 1|Y = 1) = TNN Sensitivity, P(Y = 1|Y = 1) = TP

    P

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    20/52

    Application 5-6

    Examples Small Sample

    100 solvent and insolvent companies

    X3 Operating Income / Total Asset

    X24 Account Payable / Total Asset

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    21/52

    Application 5-7

    1.0

    0.5

    0.0

    0.5

    1.0

    0.00 0.10 0.20 0.30

    0.2

    0.1

    0.0

    0.1

    0.2

    qq

    q

    qq

    qqq

    qq

    qqq

    q

    q

    q

    q

    q

    q

    q

    q

    q

    qq

    q

    q

    q

    q

    q

    q

    q

    q

    q

    q

    q

    q

    q

    q

    q

    qq

    q

    q

    q

    SVM classification plot

    x24

    x3

    1.5

    1.0

    0.5

    0.0

    0.5

    1.0

    1.5

    0.00 0.10 0.20 0.30

    0.2

    0.1

    0.0

    0.1

    0.2

    qq

    q

    q

    q

    q

    q

    q

    q

    q

    q

    q

    qqq

    q

    q

    q

    q

    q qq

    q

    q

    q

    q

    q

    q

    q

    q

    q

    q

    q

    q

    q

    qq

    q

    qq

    q

    q

    q

    qq

    SVM classification plot

    x24

    x3

    Figure 6: SVM plot, C = 1 and = 1/2, training error 0.19 (left) andGA-SVM, C = 14.86 and = 1/121.61, training error 0 (right).

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    22/52

    Application 5-8

    Credit reform data

    type solvent (%) insolvent (%) total (%)

    Manufacturing 27.37 (26.06) 25.70 (1.22) 27.29Construction 13.88 (13.22) 39.70 (1.89) 15.11Wholesale and retail 24.78 (23.60) 20.10 (0.96) 24.56

    Real estate 17.28 (16.46) 9.40 (0.45) 16.90total 83.31 (79.34) 94.90 (4.52) 83.86

    others 16.69 (15.90) 5.10 (0.24) 16.14

    # 20,000 1,000 21,000

    Table 1: Credit reform data

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    23/52

    Application 5-9

    Pre-processing

    year solvent insolvent total# (%) # (%) # (%)

    1997 872 ( 9.08) 86 (0.90) 958 ( 9.98)1998 928 ( 9.66) 92 (0.96) 1020 (10.62)

    1999 1005 (10.47) 112 (1.17) 1117 (11.63)2000 1379 (14.36) 102 (1.06) 1481 (15.42)2001 1989 (20.71) 111 (1.16) 2100 (21.87)2002 2791 (29.07) 135 (1.41) 2926 (30.47)

    total 8964 (93.36) 638 (6.64) 9602 (100)Table 2: Pre-processed credit reform data

    Support Vector Machines

  • 7/28/2019 20110809 SVM in PD Prediction

    24/52

    Application 5-10

    Scenario

    scenario training set testing set

    Scenario-1 1997 1998

    Scenario-2 1997-1998 1999Scenario-3 1997-1999 2000Scenario-4 1997-2000 2001Scenario-5 1997-2001 2002

    Table 3: Training and testing data set

    Support Vector Machines

    l

  • 7/28/2019 20110809 SVM in PD Prediction

    25/52

    Application 5-11

    Full model, X1, . . . , X28

    Predictors 28 financial ratio variables

    Population (# solutions) 20

    Evolutionary iteration (generation) 100

    Elitism 0.2 of population

    Crossover rate 0.5, mutation rate 0.1

    Optimal SVM parameters = 1/178.75 andC

    = 63.44

    Support Vector Machines

    A li i 5 12

  • 7/28/2019 20110809 SVM in PD Prediction

    26/52

    Application 5-12

    Quality of classification (1/2)

    sampletraining testing

    Disc. power

    AR 1 1AUC 1 1

    Accuracy 1 1Specificity 1 1Sensitivity 1 1

    Table 4: Discriminatry power of Scenario-1, 2, 3, 4, 5

    Support Vector Machines

    A li ti 5 13

  • 7/28/2019 20110809 SVM in PD Prediction

    27/52

    Application 5-13

    Quality of classification (2/2)

    training TE (CV) testing TE (CV)

    1997 0 (8.98) 1998 0 ( 9.02)1997-1998 0 (8.99) 1999 0 (10.03)1997-1999 0 (9.37) 2000 0 ( 6.89)1997-2000 0 (8.57) 2001 0 ( 5.29)1997-2001 0 (4.55) 2002 0 ( 4.61)

    Table 5: Percentage of Training Error (TE) and Cross-Validation (CV, with

    group=5)

    Support Vector Machines

    C l i 6 1

  • 7/28/2019 20110809 SVM in PD Prediction

    28/52

    Conclusion 6-1

    Conclusion

    Optimal feature selection (via Genetic Algorithm) leads to

    perfect classification

    Cross validation overcome the overfiting in training & testingerror

    Support Vector Machines

    G ti Al ith f

  • 7/28/2019 20110809 SVM in PD Prediction

    29/52

    Genetic Algorithm forSupport Vector Machines Optimizationin Probability of Default Prediction

    Wolfgang Hrdle

    Dedy Dwi Prastyo

    Ladislaus von Bortkiewicz Chair of StatisticsC.A.S.E. Center for Applied Statisticsand EconomicsHumboldtUniversitt zu Berlinhttp://lvb.wiwi.hu-berlin.dehttp://www.case.hu-berlin.de

    References 7 1

    http://lvb.wiwi.hu-berlin.de/http://www.case.hu-berlin.de/http://www.case.hu-berlin.de/http://lvb.wiwi.hu-berlin.de/
  • 7/28/2019 20110809 SVM in PD Prediction

    30/52

    References 7-1

    References

    Chen, S., Hrdle, W. and Moro, R.Estimation of Default Probabilities with Support Vector

    MachinesQuantitative Finance, 2011, 11, 135 - 154

    Holland, J.H.Adaptation in Natural and Artificial Systems

    University of Michigan Press, 1975

    Support Vector Machines

    References 7 2

  • 7/28/2019 20110809 SVM in PD Prediction

    31/52

    References 7-2

    References

    Karatzoglou, A. and Meyer, D.Support Vector Machines in R

    Journal of Statistical Software, 2006, 15:9, 1-28Zhang, J. L. and Hrdle, W.The Bayesian Additive Classification Tree Applied to CreditRisk ModellingComputational Statistics and Data Analysis, 2010, 54,

    1197-1205

    Support Vector Machines

    Appendix Linearly Separable Case 9-1

  • 7/28/2019 20110809 SVM in PD Prediction

    32/52

    Appendix Linearly Separable Case 9 1

    Linearly Separable Case back

    0

    Margin ( d )

    Figure 7: Separating hyperplane and its margin in linearly separable case

    Support Vector Machines

    Appendix Linearly Separable Case 9-2

  • 7/28/2019 20110809 SVM in PD Prediction

    33/52

    Appendix Linearly Separable Case 9 2

    Choose f Fsuch that margin (d + d+) is maximal No error separation, if all i = 1, 2, ..., n satisfy

    xi w+ b +1 for yi = +1xi w+ b

    1 for yi =

    1

    Both constraints are combined into

    yi(x

    i w+ b) 1 0 i = 1, 2, ..., n

    Support Vector Machines

    Appendix Linearly Separable Case 9-3

  • 7/28/2019 20110809 SVM in PD Prediction

    34/52

    Appendix Linearly Separable Case 9 3

    Distance between margins and the separating hyperplane isd+ = d = 1/w

    Maximize the margin, d+ + d = 2/w, could be attained byminimizing w or w2

    Lagrangian for the primal problem

    LP (w, b) =1

    2w2

    ni=1

    i{yi(xi w+ b) 1}

    Support Vector Machines

    Appendix Linearly Separable Case 9-4

  • 7/28/2019 20110809 SVM in PD Prediction

    35/52

    pp y p

    Karush-Kuhn-Tucker (KKT) first order optimality conditions

    LPwk

    = 0 : wk n

    i=1

    iyixik = 0 k= 1, ...,d

    LP

    b

    = 0 :n

    i=1 iyi = 0yi(x

    i w+ b) 1 0 i = 1, ..., ni 0

    i

    {yi(x

    i w+ b)

    1

    }= 0

    Support Vector Machines

    Appendix Linearly Separable Case 9-5

  • 7/28/2019 20110809 SVM in PD Prediction

    36/52

    pp y p

    Solution w =n

    i=1 iyixi, therefore

    1

    2w2

    =

    1

    2

    n

    i=1

    n

    j=1

    ijyiyjx

    i xj

    n

    i=1

    i{yi(xi w+ b) 1} = n

    i=1

    iyix

    i

    nj=1

    jyjxj +n

    i=1

    i

    = ni=1

    nj=1

    ijyiyjx

    i xj +n

    i=1

    i

    Lagrangian for the dual problem

    LD () =n

    i=1

    i 12

    ni=1

    nj=1

    ijyiyjx

    i xj

    Support Vector Machines

    Appendix Linearly Separable Case 9-6

  • 7/28/2019 20110809 SVM in PD Prediction

    37/52

    Primal and dual problems

    minw,b

    LP (w, b)

    max

    LD () s.t. i 0,n

    i=1 iyi = 0 Optimization problem is convex, therefore the dual and primal

    formulations give the same solution

    Support vector, a point i for which yi(x

    i w+ b) = 1 holds

    Support Vector Machines

    Appendix Linearly Non-separable Case 10-1

  • 7/28/2019 20110809 SVM in PD Prediction

    38/52

    Linearly Non-separable Case back

    0

    Figure 8: Hyperplane and its margin in linearly non-separable case

    Support Vector Machines

    Appendix Linearly Non-separable Case 10-2

  • 7/28/2019 20110809 SVM in PD Prediction

    39/52

    Slack variables i represent the violation from strict separation

    x

    i w+ b 1 i for yi = 1,xi w+ b 1 + i for yi = 1,

    i 0

    constraints are combined into

    yi(x

    i w+ b) 1 i and i 0

    If i > 0, the objective function is

    1

    2w2 + C

    ni=1

    i

    Support Vector Machines

    Appendix Linearly Non-separable Case 10-3

  • 7/28/2019 20110809 SVM in PD Prediction

    40/52

    Lagrange function for the primal problem

    LP (w, b, ) = 12

    w2 + Cn

    i=1

    in

    i=1i{yi

    xi w+ b

    1 + i}

    n

    i=1ii,

    where i 0 and i 0 are Lagrange multipliers

    Primal problem

    minw,b,

    LP (w, b, )

    Support Vector Machines

    Appendix Linearly Non-separable Case 10-4

  • 7/28/2019 20110809 SVM in PD Prediction

    41/52

    First order conditions

    LPwk = 0 : wk

    ni=1

    iyixik = 0

    LPb

    = 0 :n

    i=1

    iyi = 0

    LPi

    = 0 : C i i = 0

    s.t. i

    0, i

    0, ii = 0

    i{yi(xi w+ b) 1 + i} = 0

    Support Vector Machines

    Appendix Linearly Non-separable Case 10-5

  • 7/28/2019 20110809 SVM in PD Prediction

    42/52

    Note thatn

    i=1 iyib= 0. Translate primal problem into

    LD () =n

    i=1

    i 12

    ni=1

    nj=1

    ijyiyjxi xj +n

    i=1

    i (C i i)

    Last term is 0, therefore the dual problem is

    max

    LD () = max

    ni=1

    i 12

    ni=1

    nj=1

    ijyiyjx

    i xj

    ,

    s.t. 0 i C,n

    i=1

    iyi = 0

    back

    Support Vector Machines

    AppendixGenetic Algorithm 11-1

  • 7/28/2019 20110809 SVM in PD Prediction

    43/52

    What is a Genetic Algorithm ?

    Genetics algorithm is search

    and optimization techniquebased on Darwins principleon natural selection(Holland, 1975)

    Support Vector Machines

    AppendixGenetic Algorithm 11-2

  • 7/28/2019 20110809 SVM in PD Prediction

    44/52

    GA Initialization Back

    f obj

    solution

    1 1 1 1 0 1 00 1 1 0 1 1 0

    0 0 1 0 0 1 11 0 0 0 1 0 10 1 0 1 1 0 0

    1 1 1 0 0 0 11 1 0 1 1 1 01 0 1 1 0 0 10 0 1 0 1 1 11 1 0 1 0 0 1

    binary encodingpopulation

    decoding

    global maximum

    global minimum

    12277

    541969441131108923105

    real number( solution )

    Figure 9: GA at first generation

    Support Vector Machines

    AppendixGenetic Algorithm 11-3

  • 7/28/2019 20110809 SVM in PD Prediction

    45/52

    GA Convergency

    f obj

    solution

    f obj

    solution

    Figure 10: Solutions at 1st generation (left) and rth generation (right)

    Support Vector Machines

    AppendixGenetic Algorithm 11-4

  • 7/28/2019 20110809 SVM in PD Prediction

    46/52

    GA Decoding

    Figure 11: Decoding

    = lower + (upper lower)l1i=0 ai2

    i

    2l

    where is solution (i.e. parameter C or ), a is allele

    Support Vector Machines

    AppendixGenetic Algorithm 11-5

  • 7/28/2019 20110809 SVM in PD Prediction

    47/52

    GA Fitness evaluation

    Calculate f(i), i = 1, . . . , popsize Evaluate fitness, fdp(i)

    fdp(i) AR, AUC, accuracy, specificity, sensitivity

    Relative fitness, pi =fdp(

    i)popsize

    k=ifdp(i)

    S S

    S

    $

    L

    SS

    LS

    #

    SRSVL]HS

    Figure 12: Proportion to be choosen in the next iteration (generation)

    Support Vector Machines

    AppendixGenetic Algorithm 11-6

  • 7/28/2019 20110809 SVM in PD Prediction

    48/52

    GA Roulette wheel

    S S

    S

    $

    L

    SS

    LS

    #

    SRSVL]HS

    [ , ] ( )=( )( ) ( ) ( )

    rand U(0, 1)

    Select ith

    chromosome ifki=1 pi < rand < k+1i=1 pi Repeat popsize times to get popsize new chromosomes

    Support Vector Machines

    AppendixGenetic Algorithm 11-7

  • 7/28/2019 20110809 SVM in PD Prediction

    49/52

    GA Crossover

    1

    1

    1

    12

    2

    2

    2

    Figure 13: Crossover in nature

    Cross point

    Figure 14: Randomly choosen

    one-point crossover (top) and

    two-points crossover (bottom)

    Support Vector Machines

    AppendixGenetic Algorithm 11-8

  • 7/28/2019 20110809 SVM in PD Prediction

    50/52

    GA Reproductive operator[ , ]

    O

    3DUHQWV 2IIVSULQJ

    O

    O

    2IIVSULQJRIFURVVRYHU 0XWDQW

    M

    Figure 15: One-point crossover (top) and bit-flip mutation (bottom)

    Support Vector Machines

    AppendixGenetic Algorithm 11-9

  • 7/28/2019 20110809 SVM in PD Prediction

    51/52

    GA Elitism

    Best solution in each iteration is maintained in anothermemory place

    New population replaces the old one, check whether bestsolution is in the population

    If not, replace any one in the population with best solution

    Support Vector Machines

    AppendixGenetic Algorithm 11-10

  • 7/28/2019 20110809 SVM in PD Prediction

    52/52

    Nature to Computer Mapping Back

    Nature GA-SVM

    Population Set of parameterIndividual (phenotype) Parameters

    Fitness Discriminatory powerChromosome (genotype) Encoding of parameterGene Binary encodingReproduction CrossoverGeneration Iteration

    Table 6: Nature to GA-SVM mapping

    Support Vector Machines