support vector machines 簡介 林宗勳 - cmlab.csie.ntu...

13
Support Vector Machines 簡介 林宗勳 ([email protected]) 前言 Support Vector Machines(支持向量機,以下都以SVM稱呼)是一種分類 (Classification)演算法,由 Vapnik等根據統計學習理論提出的一種新的機器學習 方法。 SVM 在解決小樣本、非線性及高維模式識別問題中表現出許多特有的優勢。已 經應用於手寫體識別、三維目標識別、人臉識別、文本圖像分類等實際問題中, 性能優於已有的學習方法,表現出良好的學習能力。從有限訓練樣本得到的決策 規則對獨立的測試集仍能夠得到較小的誤差。 SVM 的概念 簡單來說,SVM 想要解決以下的問題:找出一個超平面(hyperplane),使之將兩 個不同的集合分開。為什麼使用超平面這個名詞,因為實際資料可能是屬於高維 度的資料,而超平面意指在高維中的平面。 以二維的例子來說,如下圖,我們希望能找出一條線能夠將黑點和白點分開,而 且我們還希望這條線距離這兩個集合的邊界(margin)越大越好,這樣我們才能夠 很明確的分辨這個點是屬於那個集合,否則在計算上容易因精度的問題而產生誤 差。

Upload: others

Post on 31-Aug-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Support Vector Machines ([email protected]) Support Vector Machines(SVM)(Classification) Vapnik SVM

    SVM SVM (hyperplane)

    (margin)

  • { } niyx ii ,...,1,, = and { }1,1, + idi yRx

    bxwxf T =)( 1=iy

    separating hyperplane optimal separating hyperplane (OSH)

    0)( xf )(xf

    Support Hyperplane support hyperplane support hyperplane optimal separating hyperplane support hyperplanesupport hyperplane

    =

    +=

    bxwbxw

    T

    T

    (over-parameterized problem) x, b, x, b, =1 x b (scale) optimal separating hyperplane (margin )support hyperplane( support hyperplane OSH) separating hyperplane support hyperplane d

    ( ) ( )( ) ( )0,1/1/1

    0,1/1/1

    =++=

    =+=

    bifwwbbd

    bifwwbbd

    separating hyperplan margin d

  • margin = 2d = 2/||w|| ||w|| margin support hyperplan optimal separating hyperplane 1 ()

    11

    11

    +=+

    =

    iiT

    iiT

    ybxw

    ybxw

    ( ) 01 bxwy iTi

    minimize 221 w

    subject to ( ) ibxwy iTi 01 SVM (the primal problem of the SVM) Lagrange Multiplier MethodLw, b, i(iLagrange Multiplier)

    ( )[ ]=

    =N

    ii

    Tii bxwywbwL

    1

    2 121),,(

    ( ) 01 = bxwy iTi i 0

    ( ) 01 > bxwy iTi i 0 Lagrange Multiplier Method Lagrange Multiplier Method Lagrange Multiplier Method() http://episte.math.ntu.edu.tw/entries/en_lagrange_mul/index.htmlhttp://episte.math.ntu.edu.tw/entries/en_lagrange_mul/notes.html

    http://episte.math.ntu.edu.tw/entries/en_lagrange_mul/index.htmlhttp://episte.math.ntu.edu.tw/entries/en_lagrange_mul/notes.html
  • Tutorial on Lagrange Multipliers ()http://www.twocw.net/mit/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-867Machine-LearningFall2002/CFAA1B28-850A-422E-AF84-4A1C1DB51D2C/0/lagrange.pdf L w b

    0

    0

    1

    1

    *

    1

    =

    ==

    =

    ==

    N

    iii

    N

    iiii

    N

    iiii

    y

    xywxyw

    Lagrangian

    maximize == ij

    jTijiji

    N

    iiD xxyyL 2

    11

    subject to 001

    ==

    i

    N

    iii y

    = 0pwL 01

    ==

    N

    iiii xyw

    = 0pbL 01

    ==

    N

    iii y

    ( ) 01 bxwy iTi Lagrange multiplier condition 0i

    complementary slackness ( )[ ] 01 = bxwy iTii KKT(Karush-Kuhn-Tucker conditions) Support Vector (training data)KKTsupport hyperplan support vector i 0

    http://www.twocw.net/mit/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-867Machine-LearningFall2002/CFAA1B28-850A-422E-AF84-4A1C1DB51D2C/0/lagrange.pdfhttp://www.twocw.net/mit/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-867Machine-LearningFall2002/CFAA1B28-850A-422E-AF84-4A1C1DB51D2C/0/lagrange.pdfhttp://www.twocw.net/mit/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-867Machine-LearningFall2002/CFAA1B28-850A-422E-AF84-4A1C1DB51D2C/0/lagrange.pdf
  • support vector

    ==i

    Tiii

    T bxxybxwxf ***)(

    = i

    ji

    Tjjj yxxyb

    * i is a support vector

    Kernel

    (feature space)

    x xiTxj (xi)T (xj) xiTxj x radial based function (RBF) RBF k(xi , xj) = (xi)T (xj)

    = 2

    2

    2exp),(

    ji

    ji

    xxxxk

  • SVM kernel (The Non-Separable case)

    optimal separating hyperplane OSH

    iybxw

    ybxw

    i

    iiiT

    iiiT

    +=+

    =+

    011

    11

    (cost) C

    Cost =

    k

    iiC

    minimize +i

    iCw 2

    21

    subject to ( ) ibxwy iiTi + 01

    ii 0

    Lagrange Multiplier Method

    ( )[ ] ==

    ++=N

    iii

    N

    iii

    Tii

    ii bxwyCwbwL

    11

    2 121),,,,(

  • KKT

    = 0pwL 01

    ==

    N

    iiii xyw

    = 0pbL 01

    ==

    N

    iii y

    = 0pL 0= iiC

    ( ) 01 + iiTi bxwy 0i Lagrange multiplier condition 0i Lagrange multiplier condition 0i

    complementary slackness ( )[ ] 01 =+ iiTii bxwy complementary slackness 0=ii SVM http://www.twocw.net/mit/Electrical-Engineering-and-Computer-Science/6-867Machine-LearningFall2002/LectureNotes/index.htm SVM SVM (binary classifier) SVM SVM a. (one-against-rest)

    k k SVM m SVM m SVM

    b. (one-against-one)

    SVM k(k-1)/2 SVM SVM a b SVM a b c

    http://www.twocw.net/mit/Electrical-Engineering-and-Computer-Science/6-867Machine-LearningFall2002/LectureNotes/index.htmhttp://www.twocw.net/mit/Electrical-Engineering-and-Computer-Science/6-867Machine-LearningFall2002/LectureNotes/index.htm
  • SVM

    SVM1,3,6,7 1,3 6,7 ( 1,3 SVM 6,7 SVM) 1,6 1,6 libsvm (face recognition) SVM libsvm libsvm http://www.csie.ntu.edu.tw/~cjlin/libsvm/ libsvm libsvm libsvm http://www.csie.ntu.edu.tw/~cjlin/libsvm/otherdocuments/ http://cvc.yale.edu/projects/yalefaces/yalefaces.html 15 11 (illumination)

    http://www.csie.ntu.edu.tw/%7Ecjlin/libsvm/http://www.csie.ntu.edu.tw/%7Ecjlin/libsvm/otherdocuments/http://cvc.yale.edu/projects/yalefaces/yalefaces.html
  • 8 3 1. 20 x 20 pixel20 x 20

    2. PCA (feature extraction) eigen face

    eigen face a. average face

    =

    =M

    nnM 1

    1(M )

    b. average face vector

    = ii

    c. vector A d. A covariance matrix

    ==

    =

    ==

    M

    n

    T

    nNN

    NM

    n

    Tnn AA

    ppp

    ppp

    MMC

    11

    11

    1 )var(),cov(

    ),cov()var(11

    L

    MOM

    L

    e. covariance matrix eigen vector f. 14 eigne vector g. 14 eigen vector

    14

  • libsvm libsvm : : ...

    : : ...

    ...

    TrainingSet.txt 1 1 1 1 1 1 1 1 2 2 2 2

    1:599.252014 1:142.808151 1:527.042419 1:413.529144 1:192.096954 1:396.304565 1:492.790314 1:-1109.4270 1:557.744751 1:547.599426 1:404.558716 1:237.526169

    2:287.783844 2:540.754883 2:183.781281 2:150.624924 2:824.704895 2:291.588989 2:294.677399 2:-385.86004 2:-123.60966 2:-254.80453 2:-173.10485 2:-147.54145

    3:618.0111693:236.2783053:586.4095463:604.8552863:-345.253783:848.9808963:823.7429813:205.6776583:-210.887683:496.6896363:-214.581693:-181.04240

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    14:112.386940 14:-50.537468 14:109.996880 14:62.5474210 14:-30.935461 14:43.3401150 14:96.3041920 14:138.645569 14:127.031754 14:-159.22918 14:148.403870 14:41.8436470

    ... 15 1:-916.79571 2:-569.626953 3:299.984253 ... 14:163.308319 (normalization) [-1,+1] libsvm svmscale svmscale usage: svmscale [-l lower] [-u upper] [-y y_lower y_upper] [-s save_filename] [-r restore_filename] filename (default: lower = -1, upper = 1, no y scaling) svmscale -l -1 -u 1 -s range TrainingSet.txt > TrainingSet.scale (scale) svmtrain svmtrain TrainingSet.scale.model svmtrain TrainingSet.scale

  • TrainingSet.scale.model svm_type c_svc kernel_type rbf gamma 0.0714286 nr_class 15 total_sv 119 rho 0.0148653 0.0397785 -0.0480648 -0.128948 0.11245 0.239645 0.0185018 label 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 nr_sv 8 8 8 8 7 8 8 8 8 8 8 8 8 8 8 SV svmpredict svmpredict Usage: svm-predict [options] test_file model_file output_file options: -b probability_estimates: whether to predict probability estimates, 0 or 1 (defa ult 0); for one-class SVM only 0 is supported svmpredict TrainingSet.scale TrainingSet.scale.model out Accuracy = 94.1667% (113/120) (classification) Mean squared error = 3.95 (regression) Squared correlation coefficient = 0.804937 (regression) 94.1667% (cross-validation) (grid-search) SVM cost gamma 1. cost SVM in non-separable case C 2. gamma kernel

    cost gamma cost gamma

  • libsvm n

    (cross-validation) over-fitting over-fitting

    libsvm grid.py grid.py ( ) C (grid) libsvm

    313151535 2,...,2,22,...,2,2 == C

    grid.py TrainingSet.scale [local] 13 3 43.3333 (best c=128.0, g=0.0078125, rate=82.5) [local] 13 -9 82.5 (best c=128.0, g=0.0078125, rate=82.5)

  • [local] 13 -3 80.8333 (best c=128.0, g=0.0078125, rate=82.5) 128.0 0.0078125 82.5 grid.py cost=128, gamma=0.0078125

    (scale)