learning kernels -tutorialmohri/icml2011-tutorial/tutorial...corinna cortes, mehryar mohri, afshin...

Learning Kernels -Tutorial Part I: Introduction to Kernel Methods.

Corinna CortesGoogle Research

[email protected]

Mehryar MohriCourant Institute & Google Research

[email protected]

Afshin RostamiUC Berkeley

[email protected]

mailto:[email protected]








pageCorinna Cortes, Mehryar Mohri, Afshin Rostami - ICML 2011 Tutorial.

Outline

Part I: Introduction to kernel methods.

Part II: Learning kernel algorithms.

Part III: Theoretical guarantees.

Part IV: Software tools.

2


Binary Classification Problem

Training data: sample drawn i.i.d. from set according to some distribution ,

Problem: find hypothesis in (classifier) with small generalization error .

Linear classification:

• Hypotheses based on hyperplanes.

• Linear separation in high-dimensional space.

3

h :X �→{−1, +1} H

S =((x1, y1), . . . , (xm, ym)) ∈ X×{−1, +1}.

RD(h)

X⊆RN

D


Linear Separation

Classifiers: .

4

H ={x �→sgn(w · x + b) :w ∈ RN, b ∈ R}

w·x+b=0w·x+b=0


Optimal Hyperplane: Max. Margin

Canonical hyperplane: and chosen such that for closest points .

Margin: .

5

margin

(Vapnik and Chervonenkis, 1964)

w b

ρ=minx∈S

|w·x+b|�w� =

1�w�

w·x+b=+1w·x+b=−1

w·x+b=0

|w·x + b|=1


Optimization Problem

Constrained optimization:

Properties:

• Convex optimization (strictly convex).

• Unique solution for linearly separable sample.

6

minw,b

12�w�2

subject to yi(w · xi + b) ≥ 1, i ∈ [1, m].


Support Vector Machines

Problem: data often not linearly separable in practice. For any hyperplane, there exists such that

Idea: relax constraints using slack variables

7

(Cortes and Vapnik, 1995)

xi

yi [w · xi + b] �≥ 1.

yi [w · xi + b] ≥ 1− ξi.

ξi≥0


Support vectors: points along the margin or outliers.Soft margin:

Soft-Margin Hyperplanes

8

!i

!jw·x+b=+1

w·x+b=−1

w·x+b=0

ρ = 1/�w�.


Optimization Problem


Properties:

• trade-off parameter.

• Convex optimization (strictly convex).

• Unique solution.

9

minw,b,ξ

12�w�2 + C

m�

i=1

ξi

subject to yi(w · xi + b) ≥ 1− ξi ∧ ξi ≥ 0, i ∈ [1, m].

C≥0

(Cortes and Vapnik, 1995)

pageMehryar Mohri - Foundations of Machine Learning

Dual Optimization Problem


Solution:

10

for any SV .

h(x) = sgn� m�

i=1

αiyi(xi · x) + b�,

b = yi −m�

j=1

αjyj(xj · xi) xiwith

maxα

m�

i=1

αi −12

m�

i,j=1

αiαjyiyj(xi · xj)

subject to: αi ≥ 0 ∧m�

i=1

αiyi = 0, i ∈ [1, m].


Kernel Methods

Idea:

• Define , called kernel, such that:

• often interpreted as a similarity measure.

Benefits:

• Efficiency: is often more efficient to compute than and the dot product.

• Flexibility: can be chosen arbitrarily so long as the existence of is guaranteed (Mercer’s condition).

11

Φ(x) · Φ(y) = K(x, y).

K : X×X→R

K

K

ΦK

Φ


Example - Polynomial Kernels

Definition:

Example: for and ,

12

N =2 d=2

∀x, y ∈ RN , K(x, y) = (x · y + c)d, c > 0.

K(x, y) = (x1y1 + x2y2 + c)2

=

x21

x22√

2x1x2√2c x1√2c x2

c

·

y21

y22√

2 y1y2√2c y1√2c y2

c

.


(1, 1,!

"

2,+"

2,!

"

2, 1)

XOR Problem

Use second-degree polynomial kernel with :

x1

x2(1, 1)

(-1, -1)

(-1, 1)

(1, -1)

!

2 x1x2

!

2 x1

Linearly non-separable Linearly separable by

(1, 1,!

"

2,!

"

2,+"

2, 1)

(1, 1,+!

2,"

!

2,"

!

2, 1) (1, 1,+!

2,+!

2,+!

2, 1)

13

c = 1

x1x2 = 0.


Other Standard PDS Kernels

Gaussian kernels:

Sigmoid Kernels:

14

K(x, y) = tanh(a(x · y) + b), a, b ≥ 0.

K(x, y) = exp�− ||x− y||2

2σ2

�, σ �= 0.


Consequence: SVMs with PDS Kernels


Solution:

15

with for any with0<αi <C.

(Boser, Guyon, and Vapnik, 1992)

maxα

m�

i=1

αi −12

m�

i,j=1

αiαjyiyjK(xi, xj)

subject to: 0 ≤ αi ≤ C ∧m�

i=1

αiyi = 0, i ∈ [1, m].

h(x) = sgn� m�

i=1

αiyiK(xi, x) + b�,

b = yi −m�

j=1

αjyjK(xj , xi) xi

pageMehryar Mohri - Foundations of Machine Learning

SVMs with PDS Kernels


Solution:

16

with for any with0<αi <C.

xib = yi − (α ◦ y)�Kei

h = sgn� m�

i=1

αiyiK(xi, ·) + b�,

maxα

2 α�1−α�Y�KYα

subject to: 0 ≤ α ≤ C ∧α�y = 0.


Regression Problem

Training data: sample drawn i.i.d. from set according to some distribution ,

Loss function: a measure of closeness, typically or for some .

Problem: find hypothesis in with small generalization error with respect to target

17

H

DX

S =((x1, y1), . . . , (xm, ym))∈X×Y,

with is a measurable subset.Y ⊆R

L(y, y�)=(y�−y)2 L(y, y�)= |y�−y|pp≥1

f

RD(h) = Ex∼D

�L

�h(x), f(x)

��.

h :X→R

L : Y ×Y →R+


Kernel Ridge Regression

Optimization problem:

Solution:

18

or maxα∈Rm

−α�(K + λI)α + 2α�y.

maxα∈Rm

−λα�α + 2α�y −α�Kα

with α = (K + λI)−1y.

h(x) =m�

i=1

αiK(xi, x),

(Saunders et al., 1998)


Questions

How should the user choose the kernel?

• problem similar to that of selecting features for other learning algorithms.

• poor choice learning made very difficult.

• good choice even poor learners could succeed.

The requirement from the user is thus critical.

• can this requirement be lessened?

• is a more automatic selection of features possible?

19

learning kernels -tutorialmohri/icml2011-tutorial/tutorial...corinna cortes, mehryar mohri, afshin...

Documents