support vector machines and kernel methods machine learning march 25, 2010

24
Support Vector Machines and Kernel Methods Machine Learning March 25, 2010

Upload: claude-murphy

Post on 13-Dec-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Support Vector Machines and Kernel Methods

Machine LearningMarch 25, 2010

Last Time

• Recap of the Support Vector Machines

Kernel Methods

• Points that are not linearly separable in 2 dimension, might be linearly separable in 3.

Kernel Methods

• Points that are not linearly separable in 2 dimension, might be linearly separable in 3.

Kernel Methods

• We will look at a way to add dimensionality to the data in order to make it linearly separable.

• In the extreme. we can construct a dimension for each data point

• May lead to overfitting.

6

Remember the Dual?Primal

Dual

7

Basis of Kernel Methods

• The decision process doesn’t depend on the dimensionality of the data.• We can map to a higher dimensionality of the data space.

• Note: data points only appear within a dot product.• The objective function is based on the dot product of data points – not

the data points themselves.

8

Basis of Kernel Methods

• Since data points only appear within a dot product.• Thus we can map to another space through a replacement

• The objective function is based on the dot product of data points – not the data points themselves.

Kernels

• The objective function is based on a dot product of data points, rather than the data points themselves.

• We can represent this dot product as a Kernel– Kernel Function, Kernel Matrix

• Finite (if large) dimensionality of K(xi,xj) unrelated to dimensionality of x

Kernels

• Kernels are a mapping

Kernels

• Gram Matrix:

Consider the following Kernel:

Kernels

• Gram Matrix:

Consider the following Kernel:

Kernels

• In general we don’t need to know the form of ϕ.

• Just specifying the kernel function is sufficient.• A good kernel: Computing K(xi,xj) is cheaper

than ϕ(xi)

Kernels

• Valid Kernels:– Symmetric– Must be decomposable into ϕ functions• Harder to show.• Gram matrix is positive semi-definite (psd).• Determining psd:

– all eigenvalues are positive– diagonal entries are larger than the sum of the abs.values of

the off diagonal entries in each row.

Kernels

• Given a valid kernels, K(x,z) and K’(x,z), more kernels can be made from them.– cK(x,z)– K(x,z)+K’(x,z)– K(x,z)K’(x,z)– exp(K(x,z))– …and more

Incorporating Kernels in SVMs

• Optimize αi’s and bias w.r.t. kernel• Decision function:

Some popular kernels

• Polynomial Kernel• Radial Basis Functions• String Kernels• Graph Kernels

18

Polynomial Kernels

• The dot product is related to a polynomial power of the original dot product.

• if c is large then focus on linear terms• if c is small focus on higher order terms• Very fast to calculate

19

Radial Basis Functions

• The inner product of two points is related to the distance in space between the two points.

• Placing a bump on each point.

20

String kernels

• Not a gaussian, but still a legitimate Kernel– K(s,s’) = difference in length– K(s,s’) = count of different letters– K(s,s’) = minimum edit distance

• Kernels allow for infinite dimensional inputs.– The Kernel is a FUNCTION defined over the input

space. Don’t need to specify the input space exactly

• We don’t need to manually encode the input.

21

Graph Kernels

• Define the kernel function based on graph properties– These properties must be computable in poly-time

• Walks of length < k• Paths• Spanning trees• Cycles

• Kernels allow us to incorporate knowledge about the input without direct “feature extraction”.– Just similarity in some space.

Where else can we apply Kernels?

• Anywhere that the dot product of x is used in an optimization.

• Perceptron:

Kernels in Clustering

• In clustering, it’s very common to define cluster similarity by the distance between points– k-nn (k-means)

• This distance can be replaced by a kernel.

• We’ll return to this more in the section on unsupervised techniques

Bye

• Next time– Logistic Regression