an overview of kernel-based learning methods
DESCRIPTION
An Overview of Kernel-Based Learning Methods. Yan Liu Nov 18, 2003. Outline. Introduction Theory Basis: Reproducing Kernel Hilbert space(RKHS), Mercer’s theorem, Representer theorem, regularization Kernel –based learning algorithm - PowerPoint PPT PresentationTRANSCRIPT
An Overview of Kernel-Based Learning Methods
Yan Liu
Nov 18, 2003
Outline Introduction Theory Basis:
Reproducing Kernel Hilbert space(RKHS), Mercer’s theorem, Representer theorem, regularization
Kernel –based learning algorithm Supervised learning: support vector machines(SVMs),
kernel fisher discriminant (KFD) Unsupervised learning: one class SVM , kernel PCA
Kernel design Standard kernels Making kernels from kernels Application oriented kernels: Fisher kernel
Introduction Example Idea: map the problem
into higher dimensional space.
Let F be a potentially much higher dimensional feature space. Let f : X -> F, x->f(x)
Learning problem now works with samples (f(x_1), y_1), . . . , (f(x_N)), y_N) in F × Y.
Key : Can this mapped problem be classified in a “simple” way?
Exploring Theory: Roadmap
Reproducing Kernel Hilbert Space -1
Inner product space:
Hilbert space: Hilbert space is a complete inner product space,
obeying the following:
Reproducing Kernel Hilbert Space -2
Reproducing Kernel Hilbert Space (RKHS) Gram matrix
given a kernel k(x, y), define the gram matrix to be Kij = k(xi, xj)
We say the kernel is positive definite when the corresponding gram matrix is positive definite
Definition of RKHS
Reproducing Kernel Hilbert Space -3
Reproducing properties:
Comment RKHS is a “bounded” Hilbert space RKHS is a “smoothed” Hilbert space
Mercer’s Theorem-1 Mercer’s Theorem
For discrete case, assume A is the Gram Matrix. If A is positive definite, then
Mercer’s Theorem-2 Comment
Mercer’s theorem provides a concrete way to construct the basis for a RKHS
Mercer’s condition is the only constraint for a kernel: the corresponding gram matrix must be positive definite to be a kernel
Representer Theorem-1
Representer Theorem-2 Comment
Representer theorem is a powerful result. It shows that although we search for the optimal solution in an infinite-dimension feature space, adding the regularization term reduces the problem to finite-dimensional space (training examples)
In reality, regularization and RKHS are equivalent.
Exploring Theory: Roadmap
Outline Introduction Theory Basis:
Reproducing Kernel Hilbert space(RKHS), Mercer’s theorem, Representer theorem, regularization
Kernel –based learning algorithm Supervised learning: support vector machines(SVMs),
kernel fisher discriminant (KFD) Unsupervised learning: one class SVM , kernel PCA
Kernel design Standard kernels Making kernels from kernels Application oriented kernels: Fisher kernel
Support Vector Machines-1quick overview
Support Vector Machines-1quick overview
Support Vector Machines-3 Parameter Sparsity
Most a_i are zeros
C: regularization constant : slack variables
Support Vector Machines-4Optimization technique
Chunking: Each step sovles the problem containing all non-zero
a_I plus some of the a_I violating KKT conditions
Decomposition methods: SVM_light The size of the subproblem is fixed, add and remove
one sample in each iteration
Sequential minimal optimization (SMO) Each iteration solves a quadratic problem of size two
Kernel Fisher Discriminant-1Overview of LDA Fisher’s discriminant (or LDA): find the linear projection
with the most discriminative direction Maximizing the Rayleigh coefficient
where S_w is the within class variance and S_B is between class variance.
Comparison with PCA
Kernel Fisher Discriminant-2 KFD: solves the problem of Fisher’s linear discriminant
to get a nonlinear discriminant in input space. One can express w in terms of mapped training
patterns:
The optimization problem for the KFD can be written as:
Kernel PCA -1 The basic idea of PCA: find a set of orthogonal directions
that capture most of the variance in the data.
However, sometimes the clusters are more than N (N is the number of dimensions)
Kernel PCA tries to map the data into a higher dimensional space and perform standard PCA. Using the kernel trick, we can do all our calculations in a lower dimension.
Kernel PCA -2 Covariance matrix
By definition
Then we have
Define the gram matrix
At last we have:
Therefore we simply have to solve an eigenvalue problem on the Gram matrix.
Outline Introduction Theory Basis:
Reproducing Kernel Hilbert space(RKHS), Mercer’s theorem, Representer theorem, regularization
Kernel –based learning algorithm Supervised learning: support vector machines(SVMs),
kernel fisher discriminant (KFD) Unsupervised learning: one class SVM , kernel PCA
Kernel design Standard kernels Making kernels from kernels Application oriented kernels: Fisher kernel
Standard Kernels
Making kernels out of Kernels Theorem:
K(x, z) = K1(x,z) + K2(x,z) K(x, z) = aK1(x,z) K(x, z) = K1(x,z) * K2(x, z) K(x, z) = f(x) f(z) K(x, z) = K3(Φ (x), Φ (y))
Kernel selection
Fisher-kernel Jaakolla and Haussler proposed using a generative
model as a kernel in a discriminative (non-probabilistic) kernel classifier.
Build a HMM model for each family Compute the fisher scores for each parameter in the
HMM Use scores as features and predict by SVM with RBF
kernel Good performance for protein family classification