learning a kernel matrix for nonlinear dimensionality reduction

15
Learning a Kernel Matrix for Nonlinear Dimensionality Reduction By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan

Upload: tim

Post on 10-Feb-2016

31 views

Category:

Documents


1 download

DESCRIPTION

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction. By K. Weinberger, F. Sha, and L. Saul Presented by Michael Barnathan. The Problem:. Data lies on or near a manifold . Lower dimensionality than overall space. Locally Euclidean. - PowerPoint PPT Presentation

TRANSCRIPT

  • Learning a Kernel Matrix for Nonlinear Dimensionality ReductionBy K. Weinberger, F. Sha, and L. SaulPresented by Michael Barnathan

  • The Problem:Data lies on or near a manifold.Lower dimensionality than overall space.Locally Euclidean.Example: data on a 2D line in R3, flat area on a sphere.

    Goal: Learn a kernel that will let us work in the lower-dimensional space.Unfold the manifold.First we need to know what it is!Its dimensionality.How it can vary.2D manifold on a sphere.(Wikipedia)

  • Background Assumptions:Kernel TrickMercers Theorem: Continuous, Symmetric, Positive Semi-Definite Kernel Functions can be represented as dot (inner) products in a high-dimensional space (Wikipedia; implied in paper).So we replace the dot product with a kernel function.Or Gram Matrix, Knm = (xn)T * (xm) = k(xn, xm)Kernel provides mapping into high-dimensional space.Consequence of Covers theorem: Nonlinear problem then becomes linear.Example: SVMs: xiT * xj -> (xi)T * (xj) = k(xi, xj).Linear Dimensionality Reduction Techniques:SVD, derived techniques (PCA, ICA, etc.) remove linear correlations.This reduces the dimensionality.Now combine these!Kernel PCA for nonlinear dimensionality reduction!Map input to a higher dimension using a kernel, then use PCA.

  • The (More Specific) Problem:Data described by a manifold.Using kernel PCA, discover the manifold.

    Theres only one detail missing:How do we find the appropriate kernel?

    This forms the basis of the papers approach.It is also a motivation for the paper

  • Motivation:Exploits properties of the data, not just its space.Relates kernel discovery to manifold learning.With the right kernel, kernel PCA will allow us to discover the manifold.So it has implications for both fields.Another paper by the same authors focuses on applicability to manifold learning; this paper focuses on kernel learning.Unlike previous methods, this approach is unsupervised; the kernel is learned automatically.Not specific to PCA; it can learn any kernel.

  • Methodology Idea:Semidefinite programming (optimization)Look for a locally isometric mapping from the space to the manifold.Preserves distance, angles between points.Rotation and Translation on a neighborhood.Fix the distance and angles between a point and its k nearest neighbors.Intuition:Represent points as a lattice of steel balls.Neighborhoods connected by rigid rods that fix angles and distance (local isometry constraint).Now pull the balls as far apart as possible (obj. function).The lattice flattens -> Lower dimensionality!The balls and rods represent the manifold...If the data is well-sampled (Wikipedia).Shouldnt be a problem in practice.

  • Optimization Constraints:Isometry:For all neighbors xj, xk of point xi.

    If xj and xk are neighbors of each other or another common point,

    Let Gram matrices We then have Kii + Kjj - Kij - Kji = Gii + Gjj - Gij - Gji.Positive Semidefiniteness (required for kernel trick).No negative eigenvalues.Centered on the origin ( ).So eigenvalues measure variance of PCs.Dataset can be centered if not already.

  • Objective FunctionWe want to maximize pairwise distances.This is an inversion of SSE/MSE!So we have Which is just Tr(K)!Proof: (Not given in paper)

  • Semidefinite Embedding (SDE)Maximize Tr(K) subject to:K 0 Kii + Kjj - Kij - Kji = Gii + Gjj - Gij - Gji for all i,j that are neighbors of each other or a common point.This optimization is convex, and thus has a unique solution.Use semidefinite programming to perform the optimization (no SDP details in paper).Once we have the optimal kernel, perform kPCA.This technique (SDE) is this papers contribution.

  • Experimental SetupFour kernels:SDE (proposed)LinearPolynomialGaussianSwiss Roll Dataset.23 dimensions.3 meaningful (top right).20 filled with small noise (not shown).800 inputs.k = 4, p = 4, = 1.45 ( of 4-neighborhoods).Teapot Dataset.Same teapot, rotated 0 i < 360 degrees.23,028 dimensions (76 x 101 x 3).Only one degree of freedom (angle of rotation).400 inputs.k = 4, p = 4, = 1541.The handwriting dataset.No dimensionality or parameters specified (16x16x1 = 256D?)953 images. No images or kernel matrix shown.

  • Results Dimensionality ReductionTwo measures:Learned Kernels (SDE):

    Eigenspectra:Variance captured by individual eigenvalues.Normalized by trace (sum of eigenvalues).Seems to indicate manifold dimensionality.Swiss RollTeapotDigits

  • Results Large Margin ClassificationUsed SDE kernels with SVMs.Results were very poor.Lowering dimensionality can impair separability.Error rates:

    90/10 training/test split.Mean of 10 experiments.Decision boundary no longer linearly separable.

  • Strengths and WeaknessesStrengths:Unsupervised convex kernel optimization.Generalizes well in theory.Relates manifold learning and kernel learning.Easy to implement; just solve optimization.Intuitive (stretching a string).Weaknesses:May not generalize well in practice (SVMs).Implicit assumption: lower dimensionality is better.Not always the case (as in SVMs due to separability in higher dimensions).Robustness what if a neighborhood contains an outlier?Offline algorithm entire gram matrix required.Only a problem if N is large.Paper doesnt mention SDP details.No algorithm analysis, complexity, etc. Complexity is relatively high.In fact, no proof of convergence (according to the authors other 2004 paper).Isomap, LLE, et al. already have such proofs.

  • Possible ImprovementsIntroduce slack variables for robustness.Rods not rigid, but punished for bending.Would introduce a C parameter, as in SVMs.Incrementally accept minors of K for large values of N, use incremental kernel PCA.Convolve SDE kernel with others for SVMs?SDE unfolds manifold, other kernel makes the problem linearly separable again.Only makes sense if SDE simplifies the problem.Analyze complexity of SDP.

  • ConclusionsUsing SDP, SDE can learn kernel matrices to unfold data embedded in manifolds.Without requiring parameters.Kernel PCA then reduces dimensionality.Excellent for nonlinear dimensionality reduction / manifold learning.Dramatic results when difference in dimensionalities is high.Poorly suited for SVM classification.