non-linear support vector...

Non-linear Support Vector Machines

Andrea Passerinipasserini@disi.unitn.it

Machine Learning

Non-linear SVM

Non-linearly separable problemsHard-margin SVM can address linearly separable problemsSoft-margin SVM can address linearly separable problemswith outliersNon-linearly separable problems need a higher expressivepower (i.e. more complex feature combinations)We do not want to loose the advantages of linearseparators (i.e. large margin, theoretical guarantees)

SolutionMap input examples in a higher dimensional feature spacePerform linear classification in this higher dimensionalspace

Non-linear SVM

feature map

Φ : X → H

Φ is a function mapping each example to a higherdimensional space HExamples x are replaced with their feature mapping Φ(x)

The feature mapping should increase the expressive powerof the representation (e.g. introducing features which arecombinations of input features)Examples should be (approximately) linearly separable inthe mapped space

Non-linear SVM

Feature map

Homogeneous(d = 2)

Inhomogeneous(d = 2)

x1x2x2

1x1x2x2

Polynomial mapping

Maps features to all possible conjunctions (i.e. products) offeatures:

1 of a certain degree d (homogeneous mapping)2 up to a certain degree (inhomogeneous mapping)

Non-linear SVM

Feature map

Non-linear SVM

Non-linear Support Vector Machinesf (x)

Linear separation in feature space

SVM algorithm is applied just replacing x with Φ(x):

f (x) = wT Φ(x) + w0

A linear separation (i.e. hyperplane) in feature spacecorresponds to a non-linear separation in input space, e.g.:

)= sgn(w1x2

1 + w2x1x2 + w3x22 + w0)

Non-linear SVM

Support Vector Regression

RationaleRetain combination of regularization a data fittingRegularization means smoothness (i.e. smaller weights,lower complexity) of the learned functionUse a sparsifying loss to have few support vector

Non-linear SVM

ε-insensitive loss

`(f (x), y) = |y − f (x)|ε =

{0 if |y − f (x)| ≤ ε|y − f (x)| − ε otherwise

Tolerate small (ε) deviations from the true value (i.e. nopenalty)Defines an ε-tube of insensitiveness around true valuesThis also allows to trade off function complexity with datafitting (playing onn ε value)

Non-linear SVM

Optimization problem

minw∈X ,w0∈IR, ξ,ξ∗∈IRm

12||w||2 + C

m∑i=1

(ξi + ξ∗i )

subject to wT Φ(xi) + w0 − yi ≤ ε+ ξi

yi − (wT Φ(xi) + w0 ) ≤ ε+ ξ∗iξi , ξ

∗i ≥ 0

NoteTwo constraints for each example for the upper and lowersides of the tubeSlack variables ξi , ξ

∗i penalize predictions out of the

ε-insensitive tube

Non-linear SVM

LagrangianWe include constraints in the minimization function usingLagrange multipliers (αi , αi∗, βi , β

∗i ≥ 0):

L =12||w||2 + C

m∑i=1

(ξi + ξ∗i )−m∑

(βiξi + β∗i ξ∗i )

−m∑

αi(ε+ ξi + yi −wT Φ(xi)− w0)

−m∑

α∗i (ε+ ξ∗i − yi + wT Φ(xi) + w0)

Non-linear SVM

Dual formulationVanishing the derivatives wrt the primal variables weobtain:

∂L∂w

= w−m∑

(α∗i − αi)Φ(xi) = 0→ w =m∑

(α∗i − αi)Φ(xi)

∂L∂w0

(αi − α∗i ) = 0

∂L∂ξi

= C − αi − βi = 0→ αi ∈ [0,C]

∂L∂ξ∗i

= C − α∗i − β∗i = 0→ α∗i ∈ [0,C]

Non-linear SVM

Support Vector RegressionDual formulation

Substituting in the Lagrangian we get:

m∑i,j=1

(α∗i − αi)(α∗j − αj)Φ(xi)T Φ(xj)︸︷︷︸

||w||2

ξi (C − βi − αi)︸︷︷︸=0

ξ∗i (C − β∗i − α∗i )︸︷︷︸

−εm∑

(αi + α∗i ) +m∑

yi(α∗i − αi) + w0

m∑i=1

(αi − α∗i )︸︷︷︸=0

−m∑

(α∗i − αi)(α∗j − αj)Φ(xi)T Φ(xj)

Non-linear SVM

Dual formulation

maxα∈IRm

m∑i,j=1

(α∗i − αi)(α∗j − αj)Φ(xi)T Φ(xj)

−εm∑

(α∗i + αi) +m∑

yi(α∗i − αi)

subject tom∑

(αi − α∗i ) = 0

αi , α∗i ∈ [0,C] ∀i ∈ [1,m]

Regression function

f (x) = wT Φ(x) + w0 =m∑

(αi − α∗i )Φ(xi)T Φ(x) + w0

Non-linear SVM

Karush-Khun-Tucker conditions (KKT)At the saddle point it holds that for all i :

αi(ε+ ξi + yi −wT Φ(xi)− w0) = 0α∗i (ε+ ξ∗i − yi + wT Φ(xi) + w0) = 0

(C − αi)ξi = 0(C − α∗i )ξ∗i = 0

Non-linear SVM

Support Vectors

All patterns within the ε-tube, for which |f (xi)− yi | < ε,have αi , α

∗i = 0 and thus don’t contribute to the estimated

function f .Patterns for which either 0 < αi < C or 0 < α∗i < C are onthe border of the ε-tube, that is |f (xi)− yi | = ε. They arethe unbound support vectors.The remaining training patterns are margin errors (eitherξi > 0 or ξ∗i > 0), and reside out of the ε-insensitive region.They are bound support vectors, with correspondingαi = C or α∗i = C.

Non-linear SVM

Support Vectors

Non-linear SVM

Support Vector Regression: example for decreasing ε

Non-linear SVM

Smallest Enclosing Hypersphere

RationaleCharacterize a set of examples defining boundariesenclosing themFind smallest hypersphere in feature space enclosing datapointsAccount for outliers paying a cost for leaving examples outof the sphere

One-class classification: model a class when no negativeexamples existAnomaly/novelty detection: detect test data falling outsideof the sphere and return them as novel/anomalous (e.g.intrusion detection systems, Alzheimer’s patientsmonitoring)

Non-linear SVM

minR∈IR,o∈H,ξ∈IRm

R2 + Cm∑

subject to ||Φ(xi)− o||2 ≤ R2 + ξi i = 1, . . . ,mξi ≥ 0, i = 1, . . . ,m

Noteo is the center of the sphereR is the radius which is minimizedslack variables ξi gather costs for outliers

Non-linear SVM

Lagrangian (αi , βi ≥ 0)

L = R2 + Cm∑

ξi −m∑

αi(R2 + ξi − ||Φ(xi)− o||2)−m∑

βiξi

Vanishing the derivatives wrt primal variables

∂L∂R

= 2R(1−m∑

αi) = 0→m∑

αi = 1

∂L∂o

= 2m∑

αi(Φ(xi)− o)(−1) = 0→ om∑

αi︸︷︷︸=1

αiΦ(xi)

∂L∂ξi

= C − αi − βi = 0→ αi ∈ [0,C]

Non-linear SVM

Dual formulation

R2 (1−m∑

αi)︸︷︷︸=0

ξi (C − αi − βi)︸︷︷︸=0

αi(Φ(xi)−m∑

αjΦ(xj)︸︷︷︸o

)T (Φ(xi)−m∑

αhΦ(xh)︸︷︷︸o

Non-linear SVM

Dual formulation

m∑i=1

αi(Φ(xi)−m∑

αjΦ(xj))T (Φ(xi)−m∑

αhΦ(xh)) =

αiΦ(xi)T Φ(xi)−

m∑i=1

αiΦ(xi)T

m∑h=1

αhΦ(xh)

−m∑

m∑j=1

αjΦ(xj)T Φ(xi) +

m∑i=1

αi︸︷︷︸=1

m∑j=1

αjΦ(xj)T

m∑h=1

αhΦ(xh) =

m∑i=1

αiΦ(xi)T

m∑j=1

αjΦ(xj)

Non-linear SVM

Smallest Enclosing HypersphereDual formulation

maxα∈IRm

m∑i=1

m∑i,j=1

αiαjΦ(xi)T Φ(xj)

subject tom∑

αi = 1, 0 ≤ αi ≤ C, i = 1, . . . ,m.

Distance functionThe distance of a point from the origin is:

R2(x) = ||Φ(x)− o||2

= (Φ(x)−m∑

αiΦ(xi))T (Φ(x)−m∑

αjΦ(xj))

= Φ(x)T Φ(x)− 2m∑

αiΦ(xi)T Φ(x) +

m∑i,j=1

Non-linear SVM

Karush-Khun-Tucker conditions (KKT)At the saddle point it holds that for all i :

βiξi = 0αi(R2 + ξi − ||Φ(xi)− o||2) = 0

Support vectorsUnbound support vectors (0 < αi < C), whose images lieon the surface of the enclosing sphere.Bound support vectors (αi = C), whose images lie outsideof the enclosing sphere, which correspond to outliers.All other points (α = 0) with images inside the enclosingsphere.

Non-linear SVM

Decision functionThe radius R∗ of the enclosing sphere can be computedusing the distance function on any unbound support vectorx:

R2(x) = Φ(x)T Φ(x)−2m∑

αiΦ(xi)T Φ(x)+

m∑i,j=1

A decision function for novelty detection could be:

f (x) = sgn(

R2(x)− (R∗)2)

i.e. positive if the examples lays outside of the sphere andnegative otherwise

Non-linear SVM

Support Vector Ranking

RationaleOrder examples by relevance (e.g. email urgence, movierating)Learn scoring function predicting quality of exampleConstraint function to score xi higher than xj if it is morerelevant (pairwise comparisons for training)Easily formalized as a support vector classification task

Non-linear SVM

minw∈X ,w0∈IR, ξi,j∈IR

12||w||2 + C

∑i,j

subject to

wT Φ(xi)−wT Φ(xj) ≥ 1− ξi,j

ξi,j ≥ 0∀i , j : x i ≺ x j

NoteThere is one constraint for each pair of examples havingordering information (x i ≺ x j means the former is comesfirst in the ranking)Examples should be correctly ordered with a large margin

Non-linear SVM

Support vector classification on pairs

minw∈X ,w0∈IR, ξi,j∈IR

12||w||2 + C

∑i,j

subject to

yi,jwT (Φ(xi)− Φ(xj))︸︷︷︸Φ(xij )

≥ 1− ξi,j

ξi,j ≥ 0∀i , j : x i ≺ x j

where labels are always positive yi,j = 1

Non-linear SVM

Decision function

f (x) = wT Φ(x)

Standard support vector classification function (unbiased)Represents score of example for ranking it

Non-linear SVM

non-linear support vector...

Documents

large-scale linear support vector regression

part ii - linear analysis - srcf · 2020-01-24 · 1 normed...

machine learning: linear regression and neural...

a parametric simplex algorithm for linear vector

machinelearning: mathematicaltheory …machinelearning:...

system of non-linear equation (linear algebra & vector...

machinelearning-basedsurrogatemodeling fordata

machinelearning-basedensemblemodelforzikavirust-cell

abstract vector spaces, linear transformations, and...

forlanguageprocessing l101: machinelearning

linear mappings on vector spaces.pdf

07 machinelearning pub

non-linear support vector machines - disi, university of...

augmentingthesoftwaretestingworkflowwith...

linear pipeline_collosion vector analysis

machinelearning concepts

20130216 machinelearning khachay_lecture01

large-scale linear...

introduction to state-space control...

machinelearning cs229/stats229