linear learning machines and - unict.it › ~gfarinella › llm_svm.pdf · linear learning machine...

Linear Learning Machines andLinear Learning Machines and Support Vector Machinespp

for classification and regressionfor classification and regression

Giovanni Maria FarinellaGiovanni Maria [email protected]

Outline of this talkOutline of this talk• Machine Learning and Pattern Recognition

– Linear Learning Machine for RegressionLinear Learning Machine for Regression

S V M hi– Support Vector Machines• Classification

• Regression

Please ask question!GMF 2007

What is ML and PR?What is ML and PR?

• Interdisciplinary field focusing on both themathematical foundations and practicalpapplications of systems that learn, reason andactact.

GMF 2007

What is ML and PR useful for?What is ML and PR useful for?Object and Gesture

RecognitionRecognition

RetrievalCategorisationClusteringRelations between RoboticsRelations between pages

Bi i f i

Robotics

Bioinformatics

Medical diagnosis

Automatic speech Challenge: to improve the accuracy of

GMF 2007

Automatic speech recognition

Challenge: to improve the accuracy of movie preference predictions ($1m, 2006)

Learning ApproachesLearning Approaches• Imagine an organism/machine which experiences a series of sensory• Imagine an organism/machine which experiences a series of sensory

inputs:

x x xx1, x2,… xn

• Supervised learning: the organism/machine is also given desired outputsy y y and its goal is to learn to produce the correct output given ay1, y2, . . ., yn, and its goal is to learn to produce the correct output given anew input.

• Unsupervised learning: the goal of the organism/machine is to build a• Unsupervised learning: the goal of the organism/machine is to build amodel of x that can be used for reasoning, decision making, predictingthings, etc.

• Reinforcement learning: the organism/machine can also produce actionsa1, a2, . . . which affect the state of the world, and receives rewards (orpunishments) r1, r2, . . .. Its goal is to learn to act in a way that maximisesp ) 1, 2, g yrewards in the long term.

GMF 2007

Typical Supervised problemsTypical Supervised problems

• Classification/Recognition: the aim is toassign each input vector to one of a finiteg pnumber of category

• Regression: the desidered output consist ofone o more continuous variablesone o more continuous variables

GMF 2007

RegressionRegression

R i i h d i i f l f i• Regression is the determination of a general functiony=f(x) fitting a set of points S={(xi, yi), i=1,…n, yi R}.∈

• The function depends on a number of unknownparameters.

• Interpretations:Deterministic: samples (xi, yi) are viewed as pairs of known numbers.

These numbers might be the result of mesurements involving randomerror; however this is not used in solving the problem and thegoodness of fit is not interpreted in a statistical sense.g p

Statistical: xi are known numbers, but yi is the observed values of a RVSYi with expected value E(yi) (es: E(yi) = x11 w1 + x12 w2 +…+ x1m w1m + b)

P di ti th i ( ) l f t RVS X d Y d thPrediction: the pairs (xi, yi) are samples of two RVS X and Y and theobjective is to find the best predictor (in terms of statistical properties)Y*=f(X) of Y in terms of X GMF 2007

Regression frameworkRegression frameworke • Given a dataset S={(xi, yi), i=1,…n, yi R}

• Given some hypotheses regarding f(x)ng p

hase ∈

Given some hypotheses regarding f(x)

• Estimate parameters of f(x) by using SLear

nin

• Use the estimated parameters tohase Use the estimated parameters to

determine f(x) at new x

nitio

n ph

Rec

og

GMF 2007

Linear Learning Machine

for regressionfor regression

GMF 2007

Linear Learning MachineLinear Learning Machine

• Learning machines using hypotheses that form linear combination of input variablesp

• In particular, the problem of linear regression consist in finding a linear functionconsist in finding a linear function

bfy +⋅== xwx)(that best fits a given set X of training points labelled from Y R⊆labelled from Y R

• Interpretation: deterministic

⊆

GMF 2007

y

f(x)=y

(xi, yi)

xGMF 2007

“fitting”fitting

b• Errors are deviations of yi from b+⋅ ixwy

*vi=yi – (wxi+b)

x

• The error can, of course, be defined in others ways.ways.

GMF 2007

Least Squares

Was first used by Gaussin the 18th century forastronomical problems.

Least Squares

• Choose the line that minimises the sum of thesquares of the distances from the trainingq gpoints

(0,b)

vi

∑∑ =−⋅−=S

iS

i vby,bL ixww 22)()(

( , )

b∈∈ SS ii xx

GMF 2007

Minimise the square loss functionMinimise the square loss function • We can minimise L by differentiating with respect to the parameters (w,b)

d i h 1 li iand setting the n+1 linear expression to zero.

∑∑∈∈

=−⋅−=S

iS

i vby,bLii xx

ixww 22)()(

⎟⎟⎟⎞

⎜⎜⎜⎛

=⎟⎟⎟⎞

⎜⎜⎜⎛

===y

b MM1

''

ˆˆ)1,(ˆ),(ˆ y

xX xx ww

'1

'ii

'

Matrix Notation:ii

⎟⎠

⎜⎝

⎟⎠

⎜⎝ nyx̂ '

n

vv'wXywXyw =−−= )ˆˆ(')ˆˆ()ˆ(L

⎟⎞

⎜⎛ −−⎞⎛ ∑ bxwy

0wXXyXw

'' =+−=∂∂ ˆˆˆ2ˆ2

ˆL

'' ˆˆˆˆ⎟⎟⎟⎟⎟

⎠

⎞

⎜⎜⎜⎜⎜

⎝

⎛

−−=−=

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛=

∑

∑

bxwy

bxwy

v

v

lnlln

lll

n

MM

111

ˆˆ wXyv

N l tiyXwXX '' ˆ = Normal equations

If det( ) ≠ 0 the inverse of exist, and the solution of the least square problem is XX' ˆˆ XX' ˆˆ

'X)X'X -1 ˆˆˆ(ˆ y'X)X'Xw -1(ˆ =If is singular, the pseudo-inverse (obtained through SVD) can be used or the technique of ridge regression can be applied.

XX' ˆˆ

GMF 2007

Ridge RegressionRidge Regression

• The Problem: – There is not enough data to ensure that the matrix g

is invertible ill‐conditioned problem

noise in the data: unwise to try to match the targetXX' ˆˆ

‐ noise in the data: unwise to try to match the target output exactly

• Regularization Approach:g pp– Restrict the choice of function in some way.

Simplest regularizer: favour function that have– Simplest regularizer: favour function that have small norm Ridge Regression

GMF 2007

Linear Regression: Dual Representation

If the inverse of existsXX' ˆˆ

y'X)X'X I y'X)X'Xw -1-1 ˆˆˆ(ˆˆˆ(ˆ ===

If the inverse of existsXX

'X)X'XX'X

y'X)X'X )X'XX'X2-

-1-1

ˆˆˆ(ˆˆ

ˆˆˆ(ˆˆ(ˆˆ ==

α'Xy)'X)X'XX('X

y'X)X'XX'X2-

2

ˆˆˆˆ(ˆˆ(

==

==

y(

y'X)X'XXα -2 ˆˆˆ(ˆ= ∑=n

i ˆˆ ixw αy)( ∑=i

i1

i

The solution is a linear combination of the training pointsThe solution is a linear combination of the training points.

GMF 2007


y'X)X'XXα -2 ˆˆˆ(ˆ

y'X)X'X)X'XX

y'X)X'XXα-1-1 ˆˆˆ(ˆˆ(ˆ

(

==

==

y'X)X'X'XXX

y))-1-1-1 ˆˆˆ(ˆˆˆ

((

==

y'X'XX'X I -1-1-1

ˆˆˆˆ

ˆˆˆˆ ==

IX'X

y'X'XX'X-1-1

-1-1-1

ˆˆ

ˆˆˆˆ ==

yGy)'XXyX'X

yIX'X-1-1-1-1

11

ˆˆˆ(ˆˆ ===

==

yGy)XXyXX (

GMF 2007

The Gram MatrixThe Gram Matrix

= 'XXG ˆˆˆ

>=< jiij x,xG ˆˆˆ

Dot product is a type of similarity measure that is of particular mathematical appeal.p yp y p pp

Geometrical Interpretation: it computes the cosine of the angle between the vectorsxi and xj, provided they are normalized to length 1. Moreover, it allows thei jcomputation of the length of a vector x as sqrt(<x,x>), and of the distance betweentwo vectors as the length of the difference vector. Therefore all geometricalconstruction that can be formulated in terms of angles, lengths and distances can

GMF 2007

be carried out.


Solving for α involve solving n linearequation with n unknown.Complessity O(n3) rather than O(d3)∑=

n

i ˆˆ ix w αyGα -1ˆ=

⎟⎞

⎜⎛∑

n

Complessity O(n ) rather than O(d )

The dual representation depend onthe Gram Matrix of the inner product

=i 1

xxxwx i ˆˆˆˆ)(1

=⋅⎟⎠

⎞⎜⎝

⎛=⋅= ∑

=iif α

pof the training example.

The information about a novel

xx i ˆˆ11

=⋅= ∑∑==

n

iii

n

ii kαα example x required by the predictive

function is just the inner productbetween training points and the new

xxi ˆˆ ⋅=ik

example

If the dimension of the feature space islarger than the number of trainingii

Evaluation complexity O(nd) rather than O(d)

larger than the number of trainingexamples solving dual representation ismore efficient.

GMF 2007

Nonlinear features spaceNonlinear features space

• Linear Regression: address the problem of identifying linear relations between one y gselected variable and the input variables where the relation is assumed linearwhere the relation is assumed linear

• Often, however, the relations that are sought lare nonlinear

GMF 2007

Nonlinear features spaceNonlinear features space• Key Idea: Map the input space into a new features space in such way that the sought relations can be represented in a linear formrelations can be represented in a linear form and hence the linear regression algorithm described above will be able to detect themdescribed above will be able to detect them.

φ( )φ( )φ( )

dDRFR Dd >>⊆→∈Φ x: Map data to higher dimensional space (feature

φ( )

φ( )φ( )

φ( )

φ( )

φ( )

dimensional space (feature space) and perform linear regression in embedded space.

φ( )

φ( )φ( )

Th h i f th φ i t t th li l ti i t liThe choice of the map φ aims to convert the nonlinear relations into linear ones.Hence the map reflect our expectations about the relation y=f(x)

GMF 2007

Polynomial features spacePolynomial features space

yx

⎥⎥⎤

⎢⎢⎡

=Φ 2)( xx

x⎥⎥⎦⎢

⎢⎣

Φ3

)(xxx

x

f(x) = w1x+b

f(x) = <w,ϕ(x)>+b = w1x+w2x2+w3x3+bw1= 1/4w2= 1/4

f(x) w1x+b

GMF 2007

w3= -1/2B= -2

The effect of φThe effect of φ

• Recode our dataset S={(xi, yi), i=1,…n, yi R} as Sφ ={(φ (xi), yi), i=1,…n, yi R}

∈∈φ {(φ ( i), yi), , , yi }

= 'XXG ˆˆˆ

>=<

=

)xΦ(),xΦ(G

XXG

jiij ˆˆˆ Although the primal form could be used, problem will arise if d is very )(),( jiij

)( ˆ()ˆ ΦΦk

, p ylarge making the solution of the dxd system very expensive. O(d3) O(d)

)xx( i ˆ()ˆ Φ⋅Φ=ik If, on the other hand, we consider the dual solution, we have shown that all the information the algorithm need is the inner products between data points in the features space. O(n3+n2d) O(nd) GMF 2007

Kernel Induced Features SpacesKernel‐Induced Features Spaces

• The inner products can sometime becomputed more efficiently as direct functionp yof the input variables, without explicitycomputing the mapping In other wordsΦcomputing the mapping . In other wordsmapping step can be by‐passed.

f h f h

Φ

• A function that perform this directcomputation is known as kernel function

GMF 2007

Kernel functionKernel function• A kernel is a function k that for all x, z of the e e s a u ct o t at o a , o t einput space satisfies

k( ) φ ( ) φ ( )k(x,z)=< φ (x), φ (z)>

Where φ is a mapping from the input space to an (inner product) feature space Fan (inner product) feature space F

Note: Kernels make possible the use of feature spaces with an exponential or even infinitespaces with an exponential or even infinite number of dimensions.

GMF 2007

An Example for φ( ) and K( )An Example for φ(.) and K(.,.)φ( ) f ll• Suppose φ(.) is given as follows

• An inner product in the feature space is

• So, if we define the kernel function as follows, there is no need to carry out φ( ) explicitlyout φ(.) explicitly

• This use of kernel function to avoid carrying out φ(.) explicitly is known as the kernel trick

GMF 2007

Some Kernel FuctionsSome Kernel Fuctions

l i l)()( Rk d

f tib idi l

polynomialxxxx jiji

))2(()(

),(),(

22

+=

k

Rk d

functionsbasisradial xxxx jiji ))2(exp(),( 2σ−−=k

GMF 2007

Can be inner product in Infinite dimensional space

GMF 2007

Mercer’s TheoremMercer s Theorem

• The kernel matrix (Gram Martix) is SymmetricPositive Definite (x’Gx ≥ 0)( )

• Any symmetric positive definite matrix can beregarded as a kernel matrix that is as an innerregarded as a kernel matrix, that is as an innerproduct matrix in some space

Given a function k, it is possible to verify that it is a kernel.

More Formal: Every (semi) positive definite, symmetricfunction is a kernel, i.e., there exists a mapping Φ such thatfunction is a kernel, i.e., there exists a mapping Φ such thatit is possible to write k(x1,x2)=< Φ (x1), Φ (x2)>

GMF 2007

Kernel Methods FrameworkKernel Methods Framework

Take HomeM !Message!

GMF 2007

Support Vector Machines

for classification and regressionfor classification and regression

GMF 2007

Let me start with two class classification problem...

f(sheep features)=yf(sheep_features)=y

sheep_features is a vector ϵ dot product features space

y ϵ { = 1 =1 }y ϵ { = -1, =1 }

GMF 2007

Hyperplane classifierHyperplane classifier

• Theorem: In the separable case exist at least one choice of parameters w, b such that p

f(x)=<w,x>+b

i fsatisfy

yif(xi)>0 i=1...n f( )/|| ||yi ( i) r=f(x)/||w||

Decision function: sng(f(x))f(x)=0

GMF 2007

How choose the Hyperplane?Separable case

How choose the Hyperplane?

w1 w2w4

w3 w5GMF 2007

The simplest Hyperplane Classifier

Separable case


• Compute the mean of the two classes in the features spacep

• Assign a new point x to the class whose mean is closest to itis closest to it.

α

GMF 2007


Separable case


This classifier is quite close to the type of learning machine that we are interested in.It is linear in the features space, while the input domain, it is represented by kernel expansion.p p p y pIt is exampled based in the sense that the kernel are centered on the training samples.The main point where the more sophisticated techniques to be discussed later will deviate from theabove classifier is in the selection of examples that the kernels are centered on, and in the weight thatis put on the individual kernels in the decision function.

In the features space representation, this statement corresponds to saying that we will study allnormal vector w of the decision hyperlplanes that can be represented as linear combination of thetraining examples. For instance, we want to remove the influence of patterns that are very far form thedecision boundary, either since we expect that they will not improve the generalization error of the decisionfunction, or since we would like to reduce the computational cost of evaluating the decision function.The hyperplane will then only depend on a subset of training examples, called support vectors.

GMF 2007


Non Separable case


What happens if the classes arethe classes are inseparable?

GMF 2007

Special CaseSpecial Case

GMF 2007

From the simplest Hyperplane ClassifierFrom the simplest Hyperplane Classifier

Sheep Vector

Sheep VectorSheep Vector

GMF 2007

To Support Vector MachineTo Support Vector Machine

GMF 2007

Non Linear separating planeNon Linear separating plane

Linear Classifiers not very flexible/powerful?

Can we do better?

GMF 2007


GMF 2007


• Data has been mapped to a new, higher dimensional spacep

• An alternatively way to think about this: data still lives in original space but the definition ofstill lives in original space but the definition of distance or inner product has been changed (it

bl k l h )is possible to use kernels to do this).

• Suddenly linear methods become non‐linearSuddenly linear methods become non linear

GMF 2007

Multi‐class Classification: One‐against‐the rest

• K classes

• train k binary classifiertrain k binary classifier

1st class vs. (2‐k)th class

2st class vs. (1,3‐k)th class

::

K decision functions

f1(x)=<w1,x>+b1::

fk(x)=<wk,x>+bk GMF 2007

Multi‐class Classification: One‐against‐the rest

• Prediction

argmax j (<wj,x>+bj )

Reason: if the 1st class, then we showld have:

f (x)>=+1f1(x)>=+1

f2(x)<=‐1

:

f (x)<= 1fk(x)<=‐1

GMF 2007

Multi‐class Classification: One‐against‐one

• Train k(k‐1)/2 binary classifier

• (1,2), (1,3), (1,4)... (1,k), (2,3), (2,4) ... , (k‐1,k)(1,2), (1,3), (1,4)... (1,k), (2,3), (2,4) ... , (k 1,k)

• If 4 classes 6 binary classifier

GMF 2007

Multi‐class Classification: One‐against‐one

• For a testing data, predicting all binary classifier

• Select the one with the largest vote

• May use decision values as wellGMF 2007

Multi‐class Classification: “Decision Tree”

GMF 2007


Back to the main two class classification problem...


w1 w2w4

w3 w5GMF 2007


Back to the main two class classification problem...


w1 w2

w3 GMF 2007

Maximal Margin ClassifierSeparable case

Maximal Margin Classifier• There may exist many solutions that separate the classes

exactly.

• We should try to find the one that will give the smallestgeneralization error (the expectation value of the lossfunction).

SVM h thi bl th h th t f• SVM approaches this problem through the concept ofmargin: the smallest distance between the decisionboundary and any of the samples.boundary and any of the samples.

• In SVM the decision boundary is chosen to be the one forwhich the margin is maximized.g

• The maximum margin solution can be motivated usingstatistical learning theory (the theoretical bound on thegeneralization error is minimized).

confidenceemp VCRdydypLossR +≤= ∫ )(),()()( wxxwwGMF 2007

Maximal Margin Classifier trough SVM

Separable case


The margin is defined as the perpendicular distance between the decisionThe margin is defined as the perpendicular distance between the decision boundary and the closest of the data points (left figure).Maximizing the margin leads to a particular choice of decision boundary (right figure).(right figure). The location of these boundary is determined by a subset of datapoints, known as support vectors.

We want the hyperplane such that:}...1,0,:||min{||argmax

,nibRd

b==+>⋅<∈− xwx xx i

w GMF 2007


Separable case


}10:||min{||argmax nibRd +><∈ xwxxx

Recall: di l di t f i t f h l d fi d b f( ) 0 i

}...1,0,:||min{||argmax,

nibRb

==+>⋅<∈− xwx xx iw

perpendicular distance of a point x from an hyperplane defined by f(x)=0 is given by |f(x)|/||w||. The distance of a training point to the decision surface given by yif(xi) / ||w||= (yi(<w,xi>+b))/||w||

The margin is given by the perpendicular distance to the closest point xi from the training set and we wish to optimize the parameter w and b in order to maximize this distance. GMF 2007


Separable case


Thus the maximum margin solution is found by solvingg y g

)]}([min||||

1{argmax,

bxwyw nib

+>⋅< w ||||, wbw

Constraints: We are only interested in solutions for which all data points of the training set are correctly classified so that yif(xi)>0, i=1...n

GMF 2007


Separable case


• Direct solution of this optimization problem would be verycomplex convert into an equivalent problem that is mucheasier to solveeasier to solve.

• Observation: if we make the rescaling w kw and b kb thenthe distance from any point xi to the decision surface giventhe distance from any point xi to the decision surface, givenby yif(xi)/||w||, is unchanged.

• Rescaling w and b such that the point(s) closest to theg p ( )decision surface satisfy yi(<w,xi>+b)=1 we obtain thecanonical form of the decision hyperplane. In this case alldata points of the training set will satisfy yi(<w,xi>+b)>=1

Margin= 1/||w||

GMF 2007

The Maximal Margin optimization problem

Separable case

The Maximal Margin optimization problem

• We want w, b such that

i i i diw 1|||| 2

• Subject to the constraints

minimized is ww ><= ,22

||||


nibyi ...11),( =≥+>< xw i

• Is a quadratic programming problem in which we are trying to minimize a quadratic function subject to a set of linear inequality

this constrained optimization problem is dealt with by introducingthis constrained optimization problem is dealt with by introducing Lagrange multiplies αi>=0

Lagrangian Rule: for constraints of the form c i >=0, the constraints equations g g i , qare multiplied by α i >=0 and subtracted from the objective function

GMF 2007

The Maximal Margin optimization problem: Separable case

primal form

∑=

−+><−=n

iiiP bybL

1

2 )1),((||||21),,( ixwwαw α

The Lagrangian Lp(w,b,α) has to be minimized with respect to the primal variables w and b and maximized with respect to the dual variables αi.

1) We must now minimize Lp with respect to w and b and simultaneously require that the derivatives of Lp with respect to the all αi vanish all subject to th t i t 0the constraint αi>=0

2) We can equivalently solve the follows dual problem: maximize Lp subject to the constraints that the gradient of Lp with respect to w and b vanish andthe constraints that the gradient of Lp with respect to w and b vanish and subject also to the constraints that αi >=0

This particular dual formulation (called Wolfe dual) has the property that theThis particular dual formulation (called Wolfe dual) has the property that the maximum of Lp subject to constraints 2) occurs at the same values of the w, b and α, as the minimum Lp subject to contraints 1)

GMF 2007

The Maximal Margin optimization Separable case

problem: dual form

∑=

−+><−=n

iiiP bybL

1

2 )1),((||||21),,( ixwwαw α

• Requiring that the gradient of Lp with respect d b i h i h di iw and b vanish give the conditions

∑= ii y ixw α 0=∑ ii yα 0≥iα ni ...1=

• Since these are equality constraints in the dual f l ti b tit t th i t L t

∑i

∑i

i

formulation, we can substitute them into Lp to give:

∑∑ ><nn

L 1)(α ααα ∑∑==

><−=ji

jijii

iD yyL1,1

,2

)( ji xxα ααα

GMF 2007

The Maximal Margin optimization Separable case

problem: dual form• The final SVM optimization problem will be

• Choose a Kernel function k(x,x’)

• Maximize ∑∑ −=nn

kyyL )(1)( xxα αααMaximize

Kernel induced Features Space

∑∑==

−=ji

jijii

iD kyyL1,1

),(2

)( ji xxα ααα


0∑ yα 0≥

p

i 10=∑i

ii yα 0≥iα ni ...1=

GMF 2007

The decision functionSeparable case

The decision function

• We can use the parameters α* that solve the previous optimization problem to find the p p pweight vector that realises the maximal margin hyperplanemargin hyperplane

∑=i

ii y i* xw *α ∑ Φ=

iii y )*

i* (xw α

i

The solution vector thus has an expansion in terms of a subset oftraining patterns, namely those patters whose αi is non-zero, called

i

Support Vector. The Support Vector lie in the margin.

),sgn(),sgn(,sgn())(sgn()(1

bybybfDSupports

ss

n

iii +><=+><>=+><== ∑∑

∈=si xxxxxw*xx αα

1 Supportsi ∈

)),(sgn()( bkyDSupports

ss += ∑∈

sxxx αGMF 2007

Who is the bias b?Separable case

Who is the bias b?• The value of b does not appear in the dual problempp p

• Having solved the quadratic programming problem and found the values α*, we can determine the valueand found the values α , we can determine the value of b by noticy that any support xi satisfy yif(xi)=1

∑∈

+><=Supports

ss byf sii xxx α ,)(

∑∈

=⎟⎟⎠

⎞⎜⎜⎝

⎛+>⋅<

Supportsssi byy si xxα 1 A numerically more stable solution is:

∑ ∑ ⎟⎞

⎜⎛1

∑∈

=+>< iSupports

ss yby si xxα ,∑ ∑∈ ∈

⎟⎠

⎞⎜⎝

⎛><−=

Si Ssssi yy

Sb si xx ,

||1 α

⎞⎛1∑

∈

>⋅<−=Supports

ssi yyb si xxα ∑ ∑∈ ∈

⎟⎠

⎞⎜⎝

⎛−=

Si Ssssi kyy

Sb ),(

||1

si xxα

GMF 2007

SVM Toy Examples (LIBSVM)Separable case

SVM Toy Examples (LIBSVM)

GMF 2007

Important ObservationsImportant Observations

• Is a Quadratic Optimization problem: convexIs a Quadratic Optimization problem: convex, no local minima

• Solvable in polynomial timeSolvable in polynomial time

GMF 2007

Soft Margin: slack variablesNon Separable case

Soft Margin: slack variables• Relax constraints: allowing the margin constraints to be violated allow some of the training data points to be misclassified

misclassified

Lie inside the margin, but on the correct side of the decision

Correctly classified

boundary

niby 11)( =−≥+>< xw i ξ

y

niniby

i

ii

...1...11),(

=≥≥+><

0 xw i

ξξ

GMF 2007

Soft Margin optimization problemNon Separable case

Soft Margin optimization problem

• We want w, b such that

minimizedisww ∑+>< C ξ1


minimized is ww ∑+><i

iC ξ,2


nibyi ...11),( =≥+>< - xw ii ξ

Ob ti• Observation:– Solution can be obtained through Dual form

Kernels can be used– Kernels can be used

GMF 2007

SVM Toy ExampleNon Separable case

SVM Toy Example

GMF 2007

SVM for Regression: error functionSVM for Regression: error function

i t i f ti

Linear Regression error function

ɛ-intensive error function

⎩⎨⎧

−−<−

=otherwiseyf

yfifyE

εε

ε |)(||)(|0

),(x

xx

ww ∑+><i

izEC )(,2λ

GMF 2007

SVM for Regression: ɛ tubeSVM for Regression: ɛ‐tube εξ +>> )(0 xwhichfor point atoscorrespond fy

misclassified εξ −<> )(0ˆ x whichfor point a to scorrespond fy

Corresponding constrains:

ξεξε ++≤≤−− )(ˆ)( xx fyf

ww ∑ ++>< C )ˆ(,21 ξξ

Objective function for SVM:

GMF 2007

misclassified

∑i

)(,2

ξξ

SVM for Regression: ɛ tubeSVM for Regression: ɛ‐tube

y

Predicted regression curve

GMF 2007

Learning Support Vector MachineLearning Support Vector Machine

• The Naive solution is a simple on‐linealgorithm for Gradient Ascent: evaluating theg ggradient for just one patter at a time, andhence updating a single component by thet

iαhence updating a single component by theincrement

i

Wα

η )( tα∂

iα

⎟⎟⎞

⎜⎜⎛

+ ∑+ ttt kyy )(111 xxααα ⎟⎟⎠

⎜⎜⎝

−+= ∑j

jjiii kyyk

),(1),( ji

ji

xxxx

ααα

GMF 2007

ReferencesReferences• Papers:

– C. J. C. Burges, A tutorial on Support Vector Machines for Pattern Recognition, Data mining andKnowledge Discovery, 2, 121‐167, 1998

– Bernhard Schölkopf, Statistical Learning and Kernel Methods, Technical Report MSR‐TR‐2000‐23

• Books:Books:– N. Cristianini, J. Shawe‐Taylor, An introduction to Support Vector Machine and other kernel‐based

methods, Cambridge Press, 2000

– John Shawe‐Taylor, Nello Cristianini, Kernel Methods for Pattern Analysis, Cambridge UniversityPress, 2004

– Bernhard Schölkopf, Alex Smola, Learning With kernel, MIT Press, Cambridge, MA, 2002

– A. Papoulis, Probability and Statistics, Prentice Hall, 1990

– Christopher M Bishop Pattern Recognition and Machine Learning Springer 2006Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006

– Duda, Hart, Stork, Pattern Classification, Wiley Interscience, 2001

• Tutorials– Chih‐Jen Lin Suport Vector Machine Talk at Machine Learning summer School 2006Chih Jen Lin ,Suport Vector Machine, Talk at Machine Learning summer School 2006

– Nello Cristianini, Support Vector Machine and Kernel Methods, ICML‐2001

– Colin Campbell, Support Vector Machine and Kernel Methods, The analysis of Patterns 2006

• Slides of Courses– Zoubin Ghahramani, Machine Learning Course (4F13), Department of Engineering, University of

Cambridge, 2006

GMF 2007

linear learning machines and - unict.it › ~gfarinella › llm_svm.pdf · linear learning machine...

Documents