new svms and kernels

Upload: dayamathi

Post on 03-Jun-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 New Svms and Kernels

    1/76

    Support Vector Machines and

    Kernel Methods

    Geoff Gordon

    [email protected]

    June 15, 2004

  • 8/12/2019 New Svms and Kernels

    2/76

    Support vector machines

    The SVM is a machine learning algorithm which

    solves classification problems uses a flexible representation of the class boundaries

    implements automatic complexity control to reduce overfitting has a single global minimum which can be found in polynomial

    time

    It is popular because

    it can be easy to use

    it often has good generalization performance

    the same algorithm solves a variety of problems with little tuning

  • 8/12/2019 New Svms and Kernels

    3/76

    SVM concepts

    Perceptrons

    Convex programming and duality

    Using maximum margin to control complexity

    Representing nonlinear boundaries with feature expansion

    The kernel trick for efficient optimization

  • 8/12/2019 New Svms and Kernels

    4/76

    Outline

    Classification problems Perceptrons and convex programs From perceptrons to SVMs

    Advanced topics

  • 8/12/2019 New Svms and Kernels

    5/76

    Classifi cation exampleFishers irises

    0

    0.5

    1

    1.5

    2

    2.5

    1 2 3 4 5 6 7

    setosaversicolor

    virginica

  • 8/12/2019 New Svms and Kernels

    6/76

    Iris data

    Three species of iris

    Measurements of petal length, width

    Iris setosais linearly separable fromI. versicolor andI. virginica

  • 8/12/2019 New Svms and Kernels

    7/76

    ExampleBoston housing data

    40 50 60 70 80 90 100 110 120 13060

    65

    70

    75

    80

    85

    90

    95

    100

    Industry & Pollution

    %M

    iddle&UpperC

    lass

  • 8/12/2019 New Svms and Kernels

    8/76

    A good linear classifi er

    40 50 60 70 80 90 100 110 120 13060

    65

    70

    75

    80

    85

    90

    95

    100

    Industry & Pollution

    %M

    iddle&UpperC

    lass

  • 8/12/2019 New Svms and Kernels

    9/76

    ExampleNIST digits

    5 10 15 20 25

    5

    10

    15

    20

    25

    5 10 15 20 25

    5

    10

    15

    20

    25

    5 10 15 20 25

    5

    10

    15

    20

    25

    5 10 15 20 25

    5

    10

    15

    20

    25

    5 10 15 20 25

    5

    10

    15

    20

    25

    5 10 15 20 25

    5

    10

    15

    20

    25

    28 28 = 784features in[0, 1]

  • 8/12/2019 New Svms and Kernels

    10/76

    Class means

    5 10 15 20 25

    5

    10

    15

    20

    25

    5 10 15 20 25

    5

    10

    15

    20

    25

  • 8/12/2019 New Svms and Kernels

    11/76

  • 8/12/2019 New Svms and Kernels

    12/76

    Sometimes a nonlinear classifi er is better

    5 10 15 20 25

    5

    10

    15

    20

    25

    5 10 15 20 25

    5

    10

    15

    20

    25

  • 8/12/2019 New Svms and Kernels

    13/76

    Sometimes a nonlinear classifi er is better

    10 8 6 4 2 0 2 4 6 8 1010

    8

    6

    4

    2

    0

    2

    4

    6

    8

    10

  • 8/12/2019 New Svms and Kernels

    14/76

    Classifi cation problem

    X y

    .

    ..

    40 50 60 70 80 90 100 110 120 13060

    65

    70

    75

    80

    85

    90

    95

    100

    Industry & Pollution

    %M

    iddle&UpperClass

    10 8 6 4 2 0 2 4 6 8 10

    10

    8

    6

    4

    2

    0

    2

    4

    6

    8

    10

    Data pointsX= [x1;x2;x3; . . .]withxi Rn

    Labelsy= [y1; y2; y3; . . .]withyi {

    1, 1

    }Solution is a subset of Rn, the classifier

    Often represented as a testf(x, learnable parameters) 0

    Define: decision surface, linear separator, linearly separable

  • 8/12/2019 New Svms and Kernels

    15/76

    What is goal?

    Classify new data with fewest possible mistakes

    Proxy: minimize some function on training data

    minw

    i

    l(yif(xi;w)) + l0(w)

    Thatsl(f(x))for +ve examples,l(f(x))for -ve

    3 2 1 0 1 2 30.5

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    3 2 1 0 1 2 30.5

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    piecewise linear loss logistic loss

  • 8/12/2019 New Svms and Kernels

    16/76

    Getting fancy

    Text? Hyperlinks? Relational database records?

    difficult to featurize w/ reasonable number of features

    but what if we could handle large or infinite feature sets?

  • 8/12/2019 New Svms and Kernels

    17/76

    Outline

    Classification problems Perceptrons and convex programs From perceptrons to SVMs

    Advanced topics

  • 8/12/2019 New Svms and Kernels

    18/76

    Perceptrons

    3 2 1 0 1 2 3

    0.5

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    Weight vectorw, biasc

    Classification rule: sign(f(x))wheref(x) = xw+ c

    Penalty for mispredicting:l(yf(x)) = [yf(x)]

    This penalty is convex inw, so all minima are global

    Note: unit-lengthxvectors

  • 8/12/2019 New Svms and Kernels

    19/76

    Training perceptrons

    Perceptron learning rule: on mistake,

    w += yx

    c += y

    That is, gradient descent onl(yf(x)), since

    w [y(x w+ c)]=yx ify(x w+ c) 0

    0 otherwise

  • 8/12/2019 New Svms and Kernels

    20/76

    Perceptron demo

    1 0.5 0 0.5 1

    1

    0.5

    0

    0.5

    1

    3 2 1 0 1 2 33

    2

    1

    0

    1

    2

    3

  • 8/12/2019 New Svms and Kernels

    21/76

    Perceptrons as linear inequalities

    Linear inequalities (for separable case):

    y(x w+ c)>0Thats

    x w+ c >0 for positive examplesx w+ c not

  • 8/12/2019 New Svms and Kernels

    22/76

    Version space

    x w+ c= 0

    As a fn of x: hyperplane w/ normalwat distancec/w from origin

    As a fn of w: hyperplane w/ normalxat distancec/x from origin

  • 8/12/2019 New Svms and Kernels

    23/76

    Convex programs

    Convex program:

    min f(x) subject to

    gi(x) 0 i= 1 . . . mwherefandgiare convex functions

    Perceptron is almost a convex program (>vs.

    )

    Trick: write

    y(x w+ c) 1

  • 8/12/2019 New Svms and Kernels

    24/76

    Slack variables

    3 2 1 0 1 2 30.5

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    If not linearly separable, add slack variables 0

    y(x w+ c) + s 1Then

    i siis total amount by which constraints are violated

    So try to makei sias small as possible

  • 8/12/2019 New Svms and Kernels

    25/76

    Perceptron as convex program

    The final convex program for the perceptron is:

    min

    i si subject to

    (yixi) w+ yic + si 1si 0

    We will try to understand this program using convex duality

  • 8/12/2019 New Svms and Kernels

    26/76

    Duality

    To every convex program corresponds a dual

    Solving original (primal) is equivalent to solving dual

    Dual often provides insight

    Can derive dual by using Lagrange multipliers to eliminate con-

    straints

  • 8/12/2019 New Svms and Kernels

    27/76

    Lagrange Multipliers

    Way to phrase constrained optimization problem as a game

    maxx f(x) subject to g(x) 0

    (assumef, g are convex downward)

    maxx mina0 f(x) + ag(x)

    Ifxplaysg(x)

  • 8/12/2019 New Svms and Kernels

    28/76

    Lagrange Multipliers: the picture

    2 1.5 1 0.5 0 0.5 1 1.5 22

    1.5

    1

    0.5

    0

    0.5

    1

    1.5

    2

  • 8/12/2019 New Svms and Kernels

    29/76

    Lagrange Multipliers: the caption

    Problem: maximize

    f(x, y) = 6x + 8y

    subject to

    g(x, y) =x2 + y2 1 0Using a Lagrange multipliera,

    maxxy

    mina0

    f(x, y) + ag(x, y)

    At optimum,

    0 =

    f(x, y) + a

    g(x, y) =

    68 + 2a

    x

    y

  • 8/12/2019 New Svms and Kernels

    30/76

    Duality for the perceptron

    Notation: zi =yixiandZ= [z1; z2; . . .], so that:

    mins,w,c i si subject toZw+ cy+ s 1s 0

    Using a Lagrange multiplier vectora

    0,

    mins,w,c maxai si aT(Zw+ cy+ s 1)

    subject to s 0 a 0

  • 8/12/2019 New Svms and Kernels

    31/76

    Duality contd

    From last slide:

    mins,w,c maxa

    i si aT(Zw+ cy+ s 1)

    subject to s 0 a 0Minimize wrtw, cexplicitly by setting gradient to 0:

    0 = aTZ

    0 = aT

    yMinimizing wrtsyields inequality:

    0 1 a

  • 8/12/2019 New Svms and Kernels

    32/76

    Duality contd

    Final form of dual program for perceptron:

    maxa 1Ta subject to

    0 = aTZ

    0 = aTy

    0 a 1

  • 8/12/2019 New Svms and Kernels

    33/76

    Problems with perceptrons

    1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1

    1

    0.8

    0.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    Vulnerable to overfitting when many input features

    Not very expressive (XOR)

  • 8/12/2019 New Svms and Kernels

    34/76

    Outline

    Classification problems

    Perceptrons and convex programs From perceptrons to SVMs

    Advanced topics

  • 8/12/2019 New Svms and Kernels

    35/76

    Modernizing the perceptron

    Three extensions:

    Margins Feature expansion

    Kernel trick

    Result is called a Support Vector Machine (reason given below)

  • 8/12/2019 New Svms and Kernels

    36/76

    Margins

    Margin is the signed distance from an example to the decisionboundary

    +ve margin points are correctly classified, -ve margin means error

  • 8/12/2019 New Svms and Kernels

    37/76

    SVMs are maximum margin

    Maximize minimum distance from data to separator

    Ball center of version space (caveats)

    Other centers: analytic center, center of mass, Bayes point

    Note: if not linearly separable, must trade margin vs. errors

  • 8/12/2019 New Svms and Kernels

    38/76

    Why do margins help?

    If our hypothesis is near the boundary of decision space, we dontnecessarily learn much from our mistakes

    If were far away from any boundary, a mistake has to eliminate a

    large volume from version space

  • 8/12/2019 New Svms and Kernels

    39/76

    Why margins help, explanation 2

    Occams razor: simple classifiers are likely to do better in practice

    Why? There are fewer simple classifiers than complicated ones, so

    we are less likely to be able to fool ourselves by finding a really good

    fit by accident.

    What does simple mean? Anything, as long as you tell me before

    you see the data.

  • 8/12/2019 New Svms and Kernels

    40/76

    Explanation 2 contd

    Simple can mean:

    Low-dimensional

    Large margin

    Short description length

    For this lecture we are interested in large margins and compact de-

    scriptions

    By contrast, many classical complexity control methods (AIC, BIC)

    rely on low dimensionality alone

  • 8/12/2019 New Svms and Kernels

    41/76

    Why margins help, explanation 3

    3 2 1 0 1 2 30.5

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    3 2 1 0 1 2 30.5

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    Margin loss is an upper bound on number of mistakes

  • 8/12/2019 New Svms and Kernels

    42/76

    Why margins help, explanation 4

  • 8/12/2019 New Svms and Kernels

    43/76

    Optimizing the margin

    Most common method: convex quadratic program

    Efficient algorithms exist (essentially the same as some interior point

    LP algorithms)

    Because QP is strictly convex,uniqueglobal optimum

    Next few slides derive the QP. Notation: Assume w.l.o.g.xi2 = 1 Ignore slack variables for now (i.e., assume linearly separable)

    if you ignore the intercept term

  • 8/12/2019 New Svms and Kernels

    44/76

    Optimizing the margin, contd

    MarginM is distance to decision surface: for pos example,(x Mw/w) w+ c = 0

    x w+ c = Mw w/w =Mw

    SVM maximizesM >0such that all margins are M:

    maxM>0,w,c M subject to

    (yixi)

    w+ yic

    M

    w

    Notation: zi =yixiandZ= [z1; z2; . . .], so that:

    Zw+ yc Mw

    Notew, cis a solution ifw, cis

  • 8/12/2019 New Svms and Kernels

    45/76

    Optimizing the margin, contd

    Divide byMw to get(Zw+ yc)/Mw 1

    Definev= wMw

    andd= cMw

    , so that v = wMw

    = 1M

    maxv,d 1/v subject to

    Zv+ y

    d 1

    Maximizing1/v is minimizing v is minimizing v2

    minv,d

    v

    2 subject to

    Zv+ yd 1Add slack variables to handle non-separable case:

    mins

    0,v

    ,d v2 + C

    isi

    subject to

    Zv+ yd + s 1

  • 8/12/2019 New Svms and Kernels

    46/76

    Modernizing the perceptron

    Three extensions:

    Margins Feature expansion Kernel trick

  • 8/12/2019 New Svms and Kernels

    47/76

    Feature expansion

    Given an examplex= [a b ...]

    Could add new features likea2,ab,a7b3,sin(b), ...

    Same optimization as before, but with longerx vectors and so longer

    wvector

    Classifier: is3a + 2b + a2 + 3ab a7b3 + 4 sin(b) 2.6?

    This classifier is nonlinear in original features, but linear in expanded

    feature space

    We have replaced x by (x) for some nonlinear , so decision

    boundary is nonlinear surfacew (x) + c= 0

  • 8/12/2019 New Svms and Kernels

    48/76

    Feature expansion example

    1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1

    1

    0.8

    0.6

    0.4

    0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    1

    0.5

    0

    0.5

    1

    1

    0.5

    0

    0.5

    1

    1

    0.5

    0

    0.5

    1

    x, y x, y, xy

  • 8/12/2019 New Svms and Kernels

    49/76

    Some popular feature sets

    Polynomials of degreek

    1, a , a2, b , b2, ab

    Neural nets (sigmoids)

    tanh(3a + 2b 1), tanh(5a 4), . . .

    RBFs of radius

    exp

    1

    2((a a0)2 + (b b0)2)

  • 8/12/2019 New Svms and Kernels

    50/76

    Feature expansion problems

    Feature expansion techniques yieldlotsof features

    E.g. polynomial of degree k on n original features yields O(nk)

    expanded features

    E.g. RBFs yield infinitely many expanded features!

    Inefficient (for i = 1 to infinity do ...)

    Overfitting (VC-dimension argument)

  • 8/12/2019 New Svms and Kernels

    51/76

    How to fi x feature expansion

    We have already shown we can handle the overfitting problem: even

    if we have lots of parameters, large margins make simple classifiers

    All thats left is efficiency

    Solution: kernel trick

  • 8/12/2019 New Svms and Kernels

    52/76

    Modernizing the perceptron

    Three extensions:

    Margins Feature expansion

    Kernel trick

  • 8/12/2019 New Svms and Kernels

    53/76

    Kernel trick

    Way to make optimization efficient when there are lots of features

    Compute one Lagrange multiplier per training example instead of

    one weight per feature (part I)

    Use kernel function to avoid representingwever (part II)

    Will mean we can handle infinitely many features!

  • 8/12/2019 New Svms and Kernels

    54/76

    Kernel trick, part I

    minw,c |w

    |2/2 subject to Zw+ yc

    1

    minw,c maxa0 wTw/2 + a (1 Zw yc)

    Minimize wrtw, cby setting derivatives to 0

    0 = w ZTa 0 =a y

    Substitute back in forw, c

    maxa0 a 1 aTZZTa/2 subject to a y= 0

    Note: to allow slacks, add an upper bounda C

  • 8/12/2019 New Svms and Kernels

    55/76

    What did we just do?

    max0aC a 1 aTZZTa/2 subject to a y= 0

    Now we have a QP in ainstead ofw, c

    Once we solve fora, we can findw=ZTato use for classification

    We also needc

    which we can get from complementarity:

    yixi w+ yic= 1 ai>0or as dual variable fora y= 0

  • 8/12/2019 New Svms and Kernels

    56/76

    Representation of w

    Optimalw=ZTais a linear combination of rows ofZ

    I.e.,wis a linear combination of (signed) training examples

    I.e., w has a finite representation even if there are infinitely many

    features

  • 8/12/2019 New Svms and Kernels

    57/76

    Support vectors

    Examples withai >0are called support vectors

    Support vector machine = learning algorithm (machine) based

    on support vectors

    Often many fewer than number of training examples (ais sparse)

    This is the short description of an SVM mentioned above

  • 8/12/2019 New Svms and Kernels

    58/76

    Intuition for support vectors

    40 50 60 70 80 90 100 110 120

    60

    70

    80

    90

    100

    110

    120

  • 8/12/2019 New Svms and Kernels

    59/76

  • 8/12/2019 New Svms and Kernels

    60/76

    At end of optimization

    Gradient wrtaiis1 yi(xi w+ c)

    Increaseai if (scaled) margin 1

    Stable iff (ai = 0AND margin 1) OR margin= 1

  • 8/12/2019 New Svms and Kernels

    61/76

    How to avoid writing down weights

    Suppose number of features is really big or even infinite?

    Cant write downX, so how do we solve the QP?

    Cant even write down w, so how do we classify new examples?

  • 8/12/2019 New Svms and Kernels

    62/76

  • 8/12/2019 New Svms and Kernels

    63/76

    Kernel trick, part II

    Yes, we can computeGdirectlysometimes!

    Recall thatxi was the result of applying a nonlinear feature expan-

    sion functionto some shorter vector (say qi)

    DefineK(qi,qj) =(qi) (qj)

  • 8/12/2019 New Svms and Kernels

    64/76

  • 8/12/2019 New Svms and Kernels

    65/76

    Example kernels

    Polynomial (typical component ofmight be17q21

    q32

    q4)

    K(q,q) = (1 + q q)k

    Sigmoid (typical componenttanh(q1+ 3q2))

    K(q,q) = tanh(aq q + b)

    Gaussian RBF (typical componentexp(12

    (q1 5)2))

    K(q,q) = exp(q q2/2)

  • 8/12/2019 New Svms and Kernels

    66/76

    Detail: polynomial kernel

    Suppose x=

    12q

    q2

    Thenx x= 1 + 2qq + q2(q)2

    From previous slide,

    K(q, q) = (1 + qq)2 = 1 + 2qq + q2(q)2

    Dot product + addition + exponentiation vs.O(nk)terms

  • 8/12/2019 New Svms and Kernels

    67/76

    The new decision rule

    Recall original decision rule: sign(x w+ c)

    Use representation in terms of support vectors:

    sign(xZTa+c) =sign

    i

    x xiyiai+ c =sign

    i

    K(q,qi)yiai+ c

    Since there are usually not too many support vectors, this is a rea-

    sonably fast calculation

  • 8/12/2019 New Svms and Kernels

    68/76

    Summary of SVM algorithm

    Training:

    Compute Gram matrixGij =yiyjK(qi,qj)

    Solve QP to geta

    Compute interceptcby using complementarity or duality

    Classification:

    Computeki =K(q,qi)for support vectorsqi Computef =c + i aikiyi

    Test sign(f)

  • 8/12/2019 New Svms and Kernels

    69/76

    Outline

    Classification problems

    Perceptrons and convex programs From perceptrons to SVMs

    Advanced topics

  • 8/12/2019 New Svms and Kernels

    70/76

    Advanced kernels

    All problems so far: each example is a list of numbers

    What about text, relational DBs, . . . ?

    Insight:K(x, y)can be defined whenx andy are not fixed length

    Examples:

    String kernels

    Path kernels Tree kernels

    Graph kernels

  • 8/12/2019 New Svms and Kernels

    71/76

    String kernels

    Pick (0, 1)

    cat c, a, t, ca, at, 2 ct, 2 cat

    Strings are similar if they share lots of nearly-contiguous substrings

    Works for words in phrases too: man bites dog similar to man

    bites hot dog, less similar to dog bites man

    There is an efficient dynamic-programming algorithm to evaluate

    this kernel (Lodhi et al, 2002)

  • 8/12/2019 New Svms and Kernels

    72/76

    Combining kernels

    SupposeK(x, y)andK(x, y)are kernels

    Then so are

    K+ K K for >0

    Given a set of kernelsK1, K2, . . ., can search for best

    K=1K1+ 2K2+ . . .

    using cross-validation, etc.

  • 8/12/2019 New Svms and Kernels

    73/76

    Kernel X

    Kernel trick isnt limited to SVDs

    Works whenever we can express an algorithm using only sums, dot

    products of training examples

    Examples:

    kernel Fisher discriminant

    kernel logistic regression kernel linear and ridge regression kernel SVD or PCA

    1-class learning / density estimation

  • 8/12/2019 New Svms and Kernels

    74/76

    Summary

    Perceptrons are a simple, popular way to learn a classifier

    They suffer from inefficient use of data, overfitting, and lack of ex-

    pressiveness

    SVMs fix these problems using margins and feature expansion

    In order to make feature expansion computationally feasible, we

    need the kernel trick

    Kernel trick avoids writing out high-dimensional feature vectors by

    use of Lagrange multipliers and representer theorem

    SVMs are popular classifiers because they usually achieve good

    error rates and can handle unusual types of data

  • 8/12/2019 New Svms and Kernels

    75/76

    References

    http://www.cs.cmu.edu/~ggordon/SVMs

    these slides, together with code

    http://svm.research.bell-labs.com/SVMdoc.html

    Burgess SVM tutorial

    http://citeseer.nj.nec.com/burges98tutorial.htmlBurgess paper A Tutorial on Support Vector Machines for Pattern

    Recognition (1998)

  • 8/12/2019 New Svms and Kernels

    76/76

    References

    Huma Lodhi, Craig Saunders, Nello Cristianini, John Shawe-Taylor,

    Chris Watkins. Text Classification using String Kernels. 2002.

    http://www.cs.rhbnc.ac.uk/research/compint/areas/

    comp_learn/sv/pub/slide1.ps

    Slides by Stitson & Weston

    http://oregonstate.edu/dept/math/CalculusQuestStudyGuides/

    vcalc/lagrang/lagrang.htmlLagrange Multipliers

    http://svm.research.bell-labs.com/SVT/SVMsvt.html

    on-line SVM applet