an overview of kernel-based learning methods

An Overview of Kernel-Based Learning Methods

Yan Liu

Nov 18, 2003

Outline Introduction Theory Basis:

Reproducing Kernel Hilbert space(RKHS), Mercer’s theorem, Representer theorem, regularization

Kernel –based learning algorithm Supervised learning: support vector machines(SVMs),

kernel fisher discriminant (KFD) Unsupervised learning: one class SVM , kernel PCA

Kernel design Standard kernels Making kernels from kernels Application oriented kernels: Fisher kernel

Introduction Example Idea: map the problem

into higher dimensional space.

Let F be a potentially much higher dimensional feature space. Let f : X -> F, x->f(x)

Learning problem now works with samples (f(x_1), y_1), . . . , (f(x_N)), y_N) in F × Y.

Key : Can this mapped problem be classified in a “simple” way?

Exploring Theory: Roadmap

Reproducing Kernel Hilbert Space -1

Inner product space:

Hilbert space: Hilbert space is a complete inner product space,

obeying the following:


Reproducing Kernel Hilbert Space (RKHS) Gram matrix

given a kernel k(x, y), define the gram matrix to be Kij = k(xi, xj)

We say the kernel is positive definite when the corresponding gram matrix is positive definite

Definition of RKHS


Reproducing properties:

Comment RKHS is a “bounded” Hilbert space RKHS is a “smoothed” Hilbert space

Mercer’s Theorem-1 Mercer’s Theorem

For discrete case, assume A is the Gram Matrix. If A is positive definite, then

Mercer’s Theorem-2 Comment

Mercer’s theorem provides a concrete way to construct the basis for a RKHS

Mercer’s condition is the only constraint for a kernel: the corresponding gram matrix must be positive definite to be a kernel

Representer Theorem-1

Representer Theorem-2 Comment

Representer theorem is a powerful result. It shows that although we search for the optimal solution in an infinite-dimension feature space, adding the regularization term reduces the problem to finite-dimensional space (training examples)

In reality, regularization and RKHS are equivalent.

Exploring Theory: Roadmap

Support Vector Machines-1quick overview

Support Vector Machines-3 Parameter Sparsity

Most a_i are zeros

C: regularization constant : slack variables

Support Vector Machines-4Optimization technique

Chunking: Each step sovles the problem containing all non-zero

a_I plus some of the a_I violating KKT conditions

Decomposition methods: SVM_light The size of the subproblem is fixed, add and remove

one sample in each iteration

Sequential minimal optimization (SMO) Each iteration solves a quadratic problem of size two

Kernel Fisher Discriminant-1Overview of LDA Fisher’s discriminant (or LDA): find the linear projection

with the most discriminative direction Maximizing the Rayleigh coefficient

where S_w is the within class variance and S_B is between class variance.

Comparison with PCA

Kernel Fisher Discriminant-2 KFD: solves the problem of Fisher’s linear discriminant

to get a nonlinear discriminant in input space. One can express w in terms of mapped training

patterns:

The optimization problem for the KFD can be written as:

Kernel PCA -1 The basic idea of PCA: find a set of orthogonal directions

that capture most of the variance in the data.

However, sometimes the clusters are more than N (N is the number of dimensions)

Kernel PCA tries to map the data into a higher dimensional space and perform standard PCA. Using the kernel trick, we can do all our calculations in a lower dimension.

Kernel PCA -2 Covariance matrix

By definition

Then we have

Define the gram matrix

At last we have:

Therefore we simply have to solve an eigenvalue problem on the Gram matrix.

Standard Kernels

Making kernels out of Kernels Theorem:

K(x, z) = K1(x,z) + K2(x,z) K(x, z) = aK1(x,z) K(x, z) = K1(x,z) * K2(x, z) K(x, z) = f(x) f(z) K(x, z) = K3(Φ (x), Φ (y))

Kernel selection

Fisher-kernel Jaakolla and Haussler proposed using a generative

model as a kernel in a discriminative (non-probabilistic) kernel classifier.

Build a HMM model for each family Compute the fisher scores for each parameter in the

HMM Use scores as features and predict by SVM with RBF

kernel Good performance for protein family classification

an overview of kernel-based learning methods

Documents