algorithms for manifold learning - lawrence cayton · algorithms for manifold learning lawrence...

Algorithms for manifold learning

Lawrence [email protected]

June 15, 2005

Abstract

Manifold learning is a popular recent approach tononlinear dimensionality reduction. Algorithms forthis task are based on the idea that the dimensional-ity of many data sets is only artificially high; thougheach data point consists of perhaps thousands of fea-tures, it may be described as a function of only afew underlying parameters. That is, the data pointsare actually samples from a low-dimensional manifoldthat is embedded in a high-dimensional space. Man-ifold learning algorithms attempt to uncover theseparameters in order to find a low-dimensional repre-sentation of the data. In this paper, we discuss themotivation, background, and algorithms proposed formanifold learning. Isomap, Locally Linear Embed-ding, Laplacian Eigenmaps, Semidefinite Embedding,and a host of variants of these algorithms are exam-ined.

1 Introduction

Many recent applications of machine learning – indata mining, computer vision, and elsewhere – re-quire deriving a classifier or function estimate froman extremely large data set. Modern data sets of-ten consist of a large number of examples, each ofwhich is made up of many features. Though accessto an abundance of examples is purely beneficial toan algorithm attempting to generalize from the data,managing a large number of features – some of whichmay be irrelevant or even misleading – is typically aburden to the algorithm. Overwhelmingly complexfeature sets will slow the algorithm down and makefinding global optima difficult. To lessen this burdenon standard machine learning algorithms (e.g. clas-sifiers, function estimators), a number of techniqueshave been developed to vastly reduce the quantity offeatures in a data set —i.e. to reduce the dimension-ality of data.

Dimensionality reduction has other, related uses inaddition to simplifying data so that it can be effi-ciently processed. Perhaps the most obvious is vi-sualization; if data lies in a 100-dimensional space,one cannot get an intuitive feel for what the datalooks like. However, if a meaningful two- or three-dimensional representation of the data can be found,then it is possible to “eyeball” it. Though this mayseem like a trivial point, many statistical and machinelearning algorithms have very poor optimality guar-antees, so the ability to actually see the data and theoutput of an algorithm is of great practical interest.

Beyond visualization, a dimensionality reductionprocedure may help reveal what the underlying forcesgoverning a data set are. For example, suppose we areto classify an email as spam or not spam. A typicalapproach to this problem would be to represent anemail as a vector of counts of the words appearing inthe email. The dimensionality of this data can easilybe in the hundreds, yet an effective dimensionalityreduction technique may reveal that there are only afew exceptionally telling features, such as the word“Viagra.”

There are many approaches to dimensionality re-duction based on a variety of assumptions and usedin a variety of contexts. We will focus on an approachinitiated recently based on the observation that high-dimensional data is often much simpler than the di-mensionality would indicate. In particular, a givenhigh-dimensional data set may contain many fea-tures that are all measurements of the same under-lying cause, so are closely related. This type of phe-nomenon is common, for example, when taking videofootage of a single object from multiple angles simul-taneously. The features of such a data set containmuch overlapping information; it would be helpful tosomehow get a simplified, non-overlapping represen-tation of the data whose features are identifiable withthe underlying parameters that govern the data. Thisintuition is formalized using the notion of a manifold:

1

the data set lies along a low-dimensional manifold em-bedded in a high-dimensional space, where the low-dimensional space reflects the underlying parametersand high-dimensional space is the feature space. At-tempting to uncover this manifold structure in a dataset is referred to as manifold learning.

1.1 Organization

We begin by looking at the intuitions, basic mathe-matics, and assumptions of manifold learning in sec-tion 2. In section 3, we detail several algorithms pro-posed for learning manifolds: Isomap, Locally LinearEmbedding, Laplacian Eigenmaps, Semidefinite Em-bedding, and some variants. These algorithms will becompared and contrasted with one another in section4. We conclude with future directions for research insection 5.

1.2 Notation and Conventions

Throughout, we will be interested in the problem ofreducing the dimensionality of a given data set con-sisting of high-dimensional points in Euclidean space.

• The high-dimensional input points will be re-ferred to as x1, x2, . . . xn. At times it will beconvenient to work with these points as a singlematrix X, where the ith row of X is xi.

• The low-dimensional representations that the di-mensionality reduction algorithms find will bereferred to as y1, y2, . . . , yn. Y is the matrix ofthese points.

• n is the number of points in the input.

• D is the dimensionality of the input (i.e. xi ∈RD).

• d is the dimensionality of the manifold that theinput is assumed to lie on and, accordingly, thedimensionality of the output (i.e. yi ∈ Rd).

• k is the number of nearest neighbors used by aparticular algorithm.

• N(i) denotes the set of the k-nearest neighborsof xi.

• Throughout, we assume that the (eigenvector,eigenvalue) pairs are ordered by the eigenvalues.That is, if the (eigenvector, eigenvalue) pairs are(vi, λi), for i = 1, . . . , n, then λ1 ≥ λ2 ≥ · · · ≥

λn. We refer to v1, . . . vd as the top d eigenvec-tors, and vn−d+1, . . . vn as the bottom d eigenvec-tors.

2 Preliminaries

In this section, we consider the motivations for man-ifold learning and introduce the basic concepts atwork. We have already discussed the importanceof dimensionality reduction for machine learning; wenow look at a classical procedure for this task andthen turn to more recent efforts.

2.1 Linear Dimensionality Reduction

Perhaps the most popular algorithm for dimension-ality reduction is Principal Components Analysis(PCA). Given a data set, PCA finds the directions(vectors) along which the data has maximum vari-ance in addition to the relative importance of thesedirections. For example, suppose that we feed a setof three-dimensional points that all lie on a two-dimensional plane to PCA. PCA will return two vec-tors that span the plane along with a third vectorthat is orthogonal to the plane (so that all of R3 isspanned by these vectors). The two vectors that spanthe plane will be given a positive weight, but the thirdvector will have a weight of zero, since the data doesnot vary along that direction. PCA is most useful inthe case when data lies on or close to a linear sub-space of the data set. Given this type of data, PCAwill find a basis for the linear subspace and allow oneto disregard the irrelevant features.

The manifold learning algorithms may be viewedas non-linear analogs to PCA, so it is worth detail-ing PCA somewhat. Suppose that X ∈ Rn×D is amatrix whose rows are D-dimensional data points.We are looking for the d-dimensional linear subspaceof RD along which the data has maximum variance.If in fact the data lies perfectly along a subspace ofRD, PCA will reveal that subspace; otherwise, PCAwill introduce some error. The objective function itoptimizes is

maxV

var(XV ),

where V is an orthogonal D×d matrix. If d = 1, thenV is simply a unit-length vector which gives the direc-tion of maximum variance. In general, the columns ofV give the d dimensions that we project the originaldata onto. This cost function admits an eigenvalue-based solution, which we now detail for d = 1. The

2

direction of maximum variation is given by a unit vec-tor v to project the data points onto, which is foundas follows:

max‖v‖=1

var(Xv) = max‖v‖=1

E(Xv)2 − (EXv)2 (1)

= max‖v‖=1

E(Xv)2 (2)

= max‖v‖=1

∑

i

(xi>v)2 (3)

= max‖v‖=1

∑

i

(v>xi)(xi>v) (4)

= max‖v‖=1

v>( ∑

i

(xixi>)

)v (5)

= max‖v‖=1

v>X>Xv. (6)

Equality 2 follows from assuming the data is mean-centered, without loss of generality. The last line isa form of Rayleigh’s Quotient and so is maximizedby setting v to the eigenvector with the largest corre-sponding eigenvector. In general, the d-dimensionalPCA solution is XV , where V is the D × d matrixwhose columns are the top d eigenvectors (i.e. (Vi, λi)are (eigenvalue, eigenvector) pairs with λ1 ≥ λ2 ≥· · · ≥ λd).

Despite PCA’s popularity it has a number of lim-itations. Perhaps the most blatant drawback is therequirement that the data lie on linear subspace. Re-turning the previous example, what if the plane wascurled as it is in figure 1? Though the data is stillintuitively two-dimensional, PCA will not correctlyextract this two-dimensional structure. The prob-lem is that this data set, commonly referred as the“swiss roll,” is a two-dimensional manifold, not a two-dimensional subspace. Manifold learning algorithmsessentially attempt to duplicate the behavior of PCA,but on manifolds instead of linear subspaces.

We now briefly review the concept of a manifoldand formalize the manifold learning problem.

2.2 Manifolds

Consider the curve shown in figure 2. Note that thecurve is in R3, yet it has zero volume, and in fact zeroarea. The extrinsic dimensionality – three – is some-what misleading since the curve can be parameterizedby a single variable. One way of formalizing this in-tuition is via the idea of a manifold : the curve isa one-dimensional manifold because it locally “lookslike” a copy of R1. Let us quickly review some basic

−10 −5 0 5 10 150

10

20

30

−15

−10

−5

0

5

10

15

Figure 1: A curled plane: the swiss roll.

02

46

810

1214

−1

−0.5

0

0.5

1−1

−0.5

0

0.5

1

Figure 2: A one-dimensional manifold embedded inthree dimensions.

3

terminology from geometry and topology in order tocrystallize this notion of dimensionality.

Definition 1. A homeomorphism is a continuousfunction whose inverse is also a continuous function.

Definition 2. A d-dimensional manifold M is setthat is locally homeomorphic with Rd. That is, foreach x ∈ M , there is an open neighborhood aroundx, Nx, and a homeomorphism f : Nx → Rd. Theseneighborhoods are referred to as coordinate patches,and the map is referred to a a coordinate chart. Theimage of the coordinate charts is referred to as theparameter space.

Manifolds are terrifically well-studied in mathe-matics and the above definition is extremely general.We will be interested only in the case where M is asubset of RD, where D is typically much larger thand. In other words, the manifold will lie in a high-dimensional space (RD), but will be homeomorphicwith a low-dimensional space (Rd, with d < D).

Additionally, all of algorithms we will look atrequire some smoothness requirements that furtherconstrain the class of manifolds considered.

Definition 3. A smooth (or differentiable) manifoldis a manifold such that each coordinate chart is dif-ferentiable with a differentiable inverse (i.e., each co-ordinate chart is a diffeomorphism).

We will be using the term embedding in a numberof places, though our usage will be somewhat looseand inconsistent. An embedding of a manifold Minto RD is a smooth homeomorphism from M to asubset of RD. The algorithms discussed on this paperfind embeddings of discrete point-sets, by which wesimply mean a mapping of the point set into anotherspace (typically lower-dimensional Euclidean space).An embedding of a dissimilarity matrix into Rd, asdiscussed in section 3.1.2, is a configuration of pointswhose interpoint distances match those given in thematrix.

With this background in mind, we proceed to themain topic of this paper.

2.3 Manifold Learning

We have discussed the importance of dimensionalityreduction and provided some intuition concerning theinadequacy of PCA for data sets lying along somenon-flat surface; we now turn to the manifold learningapproach to dimensionality reduction.

We are given data x1, x2, . . . , xn ∈ RD and we wishto reduce the dimensionality of this data. PCA is ap-propriate if the data lies on a low-dimensional sub-space of RD. Instead, we assume only that the datalies on a d-dimensional manifold embedded into RD,where d < D. Moreover, we assume that the mani-fold is given by a single coordinate chart.1 We cannow describe the problem formally.

Problem: Given points x1, . . . , xn ∈ RD

that lie on a d-dimensional manifold Mthat can be described by a single coordi-nate chart f : M → Rd, find y1, . . . , yn ∈Rd, where yi

def= f(xi).

Solving this problem is referred to as manifoldlearning, since we are trying to “learn” a manifoldfrom a set of points. A simple and oft-used examplein the manifold learning literature is the swiss roll, atwo-dimensional manifold embedded in R3. Figure 3shows the swiss roll and a “learned” two-dimensionalembedding of the manifold found using Isomap, analgorithm discussed later. Notice that points nearbyin the original data set neighbor one another in thetwo-dimensional data set; this effect occurs becausethe chart f between these two graphs is a homeomor-phism (as are all charts).

Though the extent to which “real-world” data setsexhibit low-dimensional manifold structure is still be-ing explored, a number of positive examples havebeen found. Perhaps the most notable occurrence isin video and image data sets. For example, supposewe have a collection of frames taken from a video ofa person rotating his or her head. The dimensional-ity of the data set is equal to the number of pixelsin a frame, which is generally very large. However,the images are actually governed by only a couple ofdegrees of freedom (such as the angle of rotation).Manifold learning techniques have been successfullyapplied to a number of similar video and image datasets [Ple03].

3 Algorithms

In this section, we survey a number of algorithmsproposed for manifold learning. Our treatment goesroughly in chronological order: Isomap and LLE werethe first algorithms developed, followed by Laplacian

1All smooth, compact manifolds can be described by a sin-gle coordinate chart, so this assumption is not overly restric-tive.

4

−10 −5 0 5 10 15020

40−20

−10

0

10

20

Original Data

−40 −20 0 20

−10

0

10

Isomap

Figure 3: The swiss roll (top) and the learned two-dimensional representation (bottom).

Eigenmaps. Semidefinite Embedding and the vari-ants of LLE were developed most recently.

Before beginning, we note that every algorithmthat we will discuss requires a neighborhood-size pa-rameter k. Every algorithm assumes that withineach neighborhood the manifold is approximatelyflat. Typically, one simply runs an algorithm overa variety of choices of neighborhood size and com-pares the outputs to pick the appropriate k. Otherneighborhood schemes, such as picking a neighbor-hood radius ε, are sometimes used in place of thek-nearest neighbor scheme.

3.1 Isomap

Isomap [TdL00] – short for isometric feature map-ping – was one of the first algorithms introduced formanifold learning. The algorithm is perhaps the bestknown and most applied among the multitude of pro-cedures now available for the problem at hand. Itmay be viewed as an extension to MultidimensionalScaling (MDS), a classical method for embedding dis-similarity information into Euclidean space. Isomapconsists of two main steps:

1. Estimate the geodesic distances (distances along

a manifold) between points in the input us-ing shortest-path distances on the data set’s k-nearest neighbor graph.

2. Use MDS to find points in low-dimensional Eu-clidean space whose interpoint distances matchthe distances found in step 1.

We consider each of these points in turn.

3.1.1 Estimating geodesic distances

Again, we are assuming that the data is drawn froma d-dimensional manifold embedded in RD and wehope to find a chart back into Rd. Isomap assumesfurther that there is an isometric chart —i.e. a chartthat preserves the distances between points. That is,if xi, xj are points on the manifold and G(xi, xj) isthe geodesic distance between them, then there is achart f : M → Rd such that

‖f(xi)− f(xj)‖ = G(xi, xj).

Furthermore, it is assumed that the manifold issmooth enough that the geodesic distance betweennearby points is approximately linear. Thus, the Eu-clidean distance between nearby points in the high-dimensional data space is assumed to be a good ap-proximation to the geodesic distances between thesepoints.

For points that are distant in the high-dimensionalspace, the Euclidean distance between them is not agood estimate of the geodesic distance. The problemis that though the manifold is locally linear (at leastapproximately), this approximation breaks down asthe distance between points increases. Estimatingdistances between distant points, thus, requires a dif-ferent technique. To perform this estimation, theIsomap algorithm first constructs a k-nearest neigh-bor graph that is weighted by the Euclidean dis-tances. Then, the algorithm runs a shortest-path al-gorithm (Dijkstra’s or Floyd’s) and uses its outputas the estimates for the remainder of the geodesicdistances.

3.1.2 Multidimensional Scaling

Once these geodesic distances are calculated, theIsomap algorithm finds points whose Euclidean dis-tances equal these geodesic distances. Since the man-ifold is isometrically embedded, such points exist, andin fact, they are unique up to translation and ro-tation. Multidimensional Scaling [CC01] is a classi-cal technique that may be used to find such points.

5

The basic problem it solves is this: given a matrixD ∈ Rn×n of dissimilarities, construct a set of pointswhose interpoint Euclidean distances match those inD closely. There are numerous cost functions for thistask and a variety of algorithms for minimizing thesecost functions. We focus on classical MDS (cMDS),since that is what is used by Isomap.

The cMDS algorithm, shown in figure 4, is based onthe following theorem which establishes a connectionbetween the space of Euclidean distance matrices andthe space of Gram matrices (inner-product matrices).

Theorem 1. A nonnegative symmetric matrix D ∈Rn×n, with zeros on the diagonal, is a Euclidean dis-tance matrix if and only if B

def= − 12HDH, where

Hdef= I − 1

n11>, is positive semidefinite. Further-more, this B will be the Gram matrix for a mean-centered configuration with interpoint distances givenby D.

A proof of this theorem is given in the appendix.This theorem gives a simple technique to converta Euclidean distance matrix D into an appropriateGram matrix B. Let X be the configuration that weare after. Since B is its Gram matrix, B = XX>.To get X from B, we spectrally decompose B intoUΛU>. Then, X = UΛ1/2.

What if D is not Euclidean? In the application ofIsomap, for example, D will generally not be perfectlyEuclidean since we are only able to approximate thegeodesic distances. In this case, B, as described in theabove theorem, will not be positive semidefinite andthus will not be a Gram matrix. To handle this case,cMDS projects B onto the cone of positive semidefi-nite matrices by setting its negative eigenvalues to 0(step 3).

Finally, users of cMDS often want a low-dimensional embedding. In general, X := UΛ1/2

+

(step 4) will be an n-dimensional configuration. Thisdimensionality is reduced to d in step 5 of the algo-rithm by simply removing the n− d trailing columnsof X. Doing so is equivalent to projecting X onto itstop d principal components, as we now demonstrate.Let X

def= UΛ1/2+ be the n-dimensional configuration

found in step 4. Then, UΛ+U> is the spectral de-composition of XX>. Suppose that we use PCA toproject X onto d dimensions. Then, PCA returnsXV , where V ∈ Rn×d and the columns of V aregiven by the eigenvectors of X>X:v1, . . . , vd. Let uscompute these eigenvectors.

X>Xvi = ξivi

classical Multidimensional Scalinginput: D ∈ Rn×n(Dii = 0, Dij ≥ 0), d ∈ {1, . . . , n}

1. Set B := − 12HDH, where H = I − 1

n11> is thecentering matrix.

2. Compute the spectral decomposition of B: B =UΛU>.

3. Form Λ+ by setting [Λ+]ij := max{Λij , 0}.4. Set X := UΛ1/2

+ .5. Return [X]n×d.

Figure 4: Classical MDS.

(U>Λ1/2+ )>(UΛ1/2

+ )vi = ξivi

(Λ1/2+ U>UΛ1/2

+ )vi = ξivi

Λ+vi = ξivi.

Thus, vi = ei, where ei is the ith standard basis vec-tor for Rn, and ξi = 1. Then, XV = X[e1 · · · ed] =[X]n×d, so the truncation performed in step 5 is in-deed equivalent to projecting X onto its first d prin-cipal components.

This equivalence is useful because if a set of dis-similarities can be realized in d-dimensions, then allbut the first d columns of X will be zeros; moreover,Λ+ reveals the relative importance of each dimension,since these eigenvalues are the same as those foundby PCA. In other words, classical MDS automaticallyfinds the dimensionality of the configuration associ-ated with a set of dissimilarities, where dimensional-ity refers to the dimensionality of the subspace alongwhich X lies. Within the context of Isomap, classicalMDS automatically finds the dimensionality of theparameter space of the manifold, which is the dimen-sionality of the manifold.

Classical MDS finds the optimal d-dimensionalconfiguration of points for the cost function ‖HDH−HD∗H‖2F , where D∗ runs over all Euclidean distancematrices for d-dimensional configurations. Unfortu-nately, this cost function makes cMDS tremendouslysensitive to noise and outliers; moreover, the prob-lem is NP-hard for a wide variety of more robust costfunctions [CD05].

3.1.3 Putting it together

The Isomap algorithm consists of estimating geodesicdistances using shortest-path distances and then find-ing an embedding of these distances in Euclidean

6

space using cMDS. Figure 5 summarizes the algo-rithm.

One particularly helpful feature of Isomap – notfound in some of the other algorithms – is that itautomatically provides an estimate of the dimension-ality of the underlying manifold. In particular, thenumber of non-zero eigenvalues found by cMDS givesthe underlying dimensionality. Generally, however,there will be some noise and uncertainty in the esti-mation process, so one looks for a tell-tale gap in thespectrum to determine the dimensionality.

Unlike many of the algorithms to be discussed,Isomap has an optimality guarantee [BdSLT00];roughly, Isomap is guaranteed to recover the parame-terization of a manifold under the following assump-tions.

1. The manifold is isometrically embedded into RD.

2. The underlying parameter space is convex —i.e.the distance between any two points in the pa-rameter space is given by a geodesic that liesentirely within the parameter space. Intuitively,the parameter space of the manifold from whichwe are sampling cannot contain any holes.

3. The manifold is well sampled everywhere.

4. The manifold is compact.

The proof of this optimality begins by showing that,with enough samples, the graph distances approachthe underlying geodesic distances. In the limit, thesedistances will be perfectly Euclidean and cMDS willreturn the correct low-dimensional points. Moreover,cMDS is continuous: small changes in the input willresult in small changes in the output. Because of thisrobustness to perturbations, the solution of cMDSwill converge to the correct parameterization in thelimit.

Isomap has been extended in two major directions:to handle conformal mappings and to handle verylarge data sets [dST03]. The former extension is re-ferred to as C-Isomap and addresses the issue thatmany embeddings preserve angles, but not distances.C-Isomap has a theoretical optimality guarantee sim-ilar to that of Isomap, but requires that the manifoldbe uniformly sampled.

The second extension, Landmark Isomap, is basedon using landmark points to speed up the calcula-tion of interpoint distances and cMDS. Both of thesesteps require O(n3) timesteps2, which is prohibitive

2The distance calculation can be brought down toO(kn2 log n) using Fibonacci heaps.

Isomapinput: x1, . . . , xn ∈ RD, k

1. Form the k-nearest neighbor graph with edgeweights Wij := ‖xi − xj‖ for neighboring pointsxi, xj .

2. Compute the shortest path distances between allpairs of points using Dijkstra’s or Floyd’s algo-rithm. Store the squares of these distances inD.

3. Return Y :=cMDS(D).

Figure 5: Isomap.

for large data sets. The basic idea is to select a smallset of l landmark points, where l > d, but l ¿ n.Next, the approximate geodesic distances from eachpoint to the landmark points are computed (in con-trast to computing all n×n interpoint distances, onlyn× l are computed). Then, cMDS is run on the l× ldistance matrix for the landmark points. Finally, theremainder of the points are located using a simple for-mula based on their distances to the landmarks. Thistechnique brings the time complexity of the two com-putational bottlenecks of Isomap down considerably:the time complexity of L-Isomap is O(dln log n + l3)instead of O(n3).

3.2 Locally Linear Embedding

Locally Linear Embedding (LLE) [SR03] was intro-duced at about the same time as Isomap, but isbased on a substantially different intuition. The ideacomes from visualizing a manifold as a collectionof overlapping coordinate patches. If the neighbor-hood sizes are small and the manifold is sufficientlysmooth, then these patches will be approximately lin-ear. Moreover, the chart from the manifold to Rd willbe roughly linear on these small patches. The idea,then, is to identify these linear patches, characterizethe geometry of them, and find a mapping to Rd thatpreserves this local geometry and is nearly linear. It isassumed that these local patches will overlap with oneanother so that the local reconstructions will combineinto a global one.

As in all of these methods, the number of neighborschosen is a parameter of the algorithm. Let N(i) bethe set of the k-nearest neighbors of point xi. Thefirst step of the algorithm models the manifold as acollection of linear patches and attempts to charac-

7

terize the geometry of these linear patches. To doso, it attempts to represent xi as a weighted, convexcombination of its nearest-neighbors. The weightsare chosen to minimize the following squared errorfor each i:

∥∥∥∥xi −∑

j∈N(i)

Wijxj

∥∥∥∥2

.

The weight matrix W is used as a surrogate forthe local geometry of the patches. Intuitively, Wi

reveals the layout of the points around xi. There area couple of constraints on the weights: each row mustsum to one (equivalently, each point is represented asa convex combination of its neighbors) and Wij = 0 ifj 6∈ N(i). The second constraint reflects that LLE isa local method; the first makes the weights invariantto global translations: if each xi is shifted by α then

∥∥∥∥xi+α−∑

j∈N(i)

Wij(xj +α)∥∥∥∥

2

=∥∥∥∥xi−

∑

j∈N(i)

Wijxj

∥∥∥∥2

,

so W is unaffected. Moreover, W is invariant toglobal rotations and scalings.

Fortunately, there is a closed form solution for W ,which may be derived using Lagrange multipliers. Inparticular, the reconstruction weights for each pointxi are given by

W̃i =

∑k C−1

jk∑lm C−1

lm

,

where C is the local covariance matrix with entriesCjk

def= (xi − ηj)>(xi − ηk) and ηj , ηk are neighborsof xi. W̃ ∈ Rn×k is then transformed into the sparsen× n matrix W ; Wij = W̃il if xj is the lth neighborof xi, and is 0 if j 6∈ N(i).

Wi is a characterization of the local geometryaround xi of the manifold. Since it is invariant to ro-tations, scalings, and translations, Wi will also char-acterize the local geometry around xi for any linearmapping of the neighborhood patch surround xi. Weexpect that the chart from this neighborhood patchto the parameter space should be approximately lin-ear since the manifold is smooth. Thus, Wi shouldcapture the local geometry of the parameter spaceas well. The second step of the algorithm finds aconfiguration in d-dimensions (the dimensionality ofthe parameter space) whose local geometry is char-acterized well by W . Unlike with Isomap, d mustbe known a priori or estimated. The d-dimensional

Locally Linear Embeddinginput: x1, . . . , xn ∈ RD, d, k

1. Compute reconstruction weights. For each pointxi, set

Wi :=

∑k C−1

jk∑lm C−1

lm

.

2. Compute the low-dimensional embedding.

• Let U be the matrix whose columns arethe eigenvectors of (I −W )>(I −W ) withnonzero accompanying eigenvalues.

• Return Y := [U ]n×d.

Figure 6: Locally Linear Embedding.

configuration is found by minimizing

∑

i

∥∥∥∥yi −∑

j

Wijyj

∥∥∥∥2

(7)

with respect to y1, . . . , yn ∈ Rd. Note that this ex-pression has the same form as the error minimized inthe first step of the algorithm, but that we are nowminimizing with respect to y1, . . . , yn, rather than W .The cost function (7) may be rewritten as

Y >MY,

where

Mijdef= δij −Wij −Wji +

∑

k

WkiWkj ,

and δijdef= 1 if i = j and 0 otherwise. M may be

written more concisely as (I −W )>(I −W ). Thereare a couple of constraints on Y . First, Y >Y = I,which forces the solution to be of rank d. Second,∑

i Yi = 0; this constraint centers the embedding onthe origin. This cost function is a form of Rayleigh’squotient, so is minimized by setting the columns ofY to the bottom d eigenvectors of M . However, thebottom (eigenvector, eigenvalue) pair is (1, 0), so thefinal coordinate of each point is identical. To avoidthis degenerate dimension, the d-dimensional embed-ding is given by the bottom non-constant d eigenvec-tors.

Conformal Eigenmaps [SS05] is a recent techniquethat may be used in conjunction with LLE to providean automatic estimate of d, the dimensionality of themanifold. LLE is summarized in figure 6.

8

3.3 Laplacian Eigenmaps

The Laplacian Eigenmaps [BN01] algorithm is basedon ideas from spectral graph theory. Given a graphand a matrix of edge weights, W , the graph Laplacianis defined as L

def= D − W , where D is the diagonalmatrix with elements Dii =

∑j Wij .3 The eigenval-

ues and eigenvectors of the Laplacian reveal a wealthof information about the graph such as whether itis complete or connected. Here, the Laplacian willbe exploited to capture local information about themanifold.

We define a “local similarity” matrix W which re-flects the degree to which points are near to one an-other. There are two choices for W :

1. Wij = 1 if xj is one of the k-nearest neighborsof xi and equals 0 otherwise.

2. Wij = e−‖xi−xj‖2/2σ2for neighboring nodes; 0

otherwise. This is the Gaussian heat kernel,which has an interesting theoretical justificationgiven in [BN03]. On the other hand, using theHeat kernel requires manually setting σ, so isconsiderably less convenient than the simple-minded weight scheme.

We use this similarity matrix W to find pointsy1, . . . , yn ∈ Rd that are the low-dimensional analogsof x1, . . . , xn. Like with LLE, d is a parameter ofthe algorithm, but it can be automatically estimatedby using Conformal Eigenmaps as a post-processor.For simplicity, we first derive the algorithm for d = 1—i.e. each yi is a scalar.

Intuitively, if xi and xk have a high degree of sim-ilarity, as measured by W , then they lie very near toone another on the manifold. Thus, yi and yj shouldbe near to one another. This intuition leads to thefollowing cost function which we wish to minimize:

∑

ij

Wij(yi − yj)2. (8)

This function is minimized by the setting y1 = · · · =yn = 0; we avoid this trivial solution by enforcing theconstraint

y>Dy = 1. (9)

This constraint effectively fixes a scale at which theyi’s are placed. Without this constraint, settingy′i = αyi, with any constant α < 1, would give aconfiguration with lower cost than the configuration

3L is referred to as the unnormalized Laplacian in manyplaces. The normalized version is D−1/2LD−1/2.

Laplacian Eigenmapsinput: x1 . . . xn ∈ RD, d, k, σ.

1. Set Wij ={

e−‖xi−xj‖2/2σ2if xj ∈ N(i)

0 otherwise.

2. Let U be the matrix whose columns are theeigenvectors of Ly = λDy with nonzero accom-panying eigenvalues.

3. Return Y := [U ]n×d.

Figure 7: Laplacian Eigenmaps.

y1, . . . , yn. Note that we could achieve the same re-sult with the constraint y>y = 1, however, using (9)leads to a simple eigenvalue solution.

Using Lagrange multipliers, it can be shown thatthe bottom eigenvector of the generalized eigenvalueproblem Ly = λDy minimizes 8. However, the bot-tom (eigenvalue,eigenvector) pair is (0,1), which isyet another trivial solution (it maps each point to 1).Avoiding this solution, the embedding is given by theeigenvector with smallest non-zero eigenvalue.

We now turn to the more general case, where we arelooking for d-dimensional representatives y1, . . . , yn

of the data points. We will find a n × d matrix Y ,whose rows are the embedded points. As before, wewish to minimize the distance between similar points;our cost function is thus

∑

ij

Wij‖yi − yj‖2 = tr(Y >LY ). (10)

In the one-dimensional case, we added a constraintto avoid the trivial solution of all zeros. Analogously,we must constrain Y so that we do not end up witha solution with dimensionality less than d. The con-straint Y >DY = I forces Y to be of full dimensional-ity; we could use Y >Y = I instead, but Y >DY = Iis preferable because it leads to a closed-form eigen-value solution. Again, we may solve this problemvia the generalized eigenvalue problem Ly = λDy.This time, however, we need the bottom d non-zeroeigenvalues and eigenvectors. Y is then given by thematrix whose columns are these d eigenvectors.

The algorithm is summarized in figure 7.

3.3.1 Connection to LLE

Though LLE is based on a different intuition thanLaplacian Eigenmaps, Belkin and Niyogi [BN03] wereable to show that that the algorithms were equivalent

9

in some situations. The equivalence rests on anothermotivation for Laplacian Eigenmaps, not discussedhere, that frames the algorithm as a discretized ver-sion of finding the eigenfunctions4 of the Laplace-Beltrami operator, L, on a manifold. The authorsdemonstrate that LLE, under certain assumptions, isfinding the eigenfunctions of 1

2L2, which are the sameas the eigenfunctions of L. In the proof, they assumethat the neighborhood patches selected by LLE areperfectly locally linear (i.e., the neighbors of a pointx actually lie on the tangent plane at x). Further-more, the equivalence is only shown in expectationover locally uniform measures on the manifold.

The connection is an intriguing one, but has notbeen exploited or even demonstrated on any real datasets. It is not clear how reasonable the assumptionsare. These two algorithms are still considered dis-tinct, with different characteristics.

3.4 Semidefinite Embedding

Semidefinite embedding (SDE) [WSS04], yet anotheralgorithm for manifold learning, is based on a physi-cal intuition. Imagine that each point is connected toits nearest neighbors with a rigid rod. Now, we takethis structure and pull it as far apart as possible –i.e.we attempt to maximize the distance between pointsthat are not neighbors. If the manifold looks like acurved version of the parameter space, then hopefullythis procedure will unravel it properly.

To make this intuition a little more concrete, letus look at a simple example. Suppose we samplesome points from a sine curve in R2. The sine curveis a one-dimensional manifold embedded into R2, solet us apply the SDE idea to it to try to find aone-dimensional representation. We first attach eachpoint to its two nearest neighbors. We then pull apartthe resulting structure as far as possible. Figure 8shows the steps of the procedure. As the figure shows,we have successfully found a one-dimensional repre-sentation of the data set. On the other hand, if theneighbors were chosen a bit differently, the procedurewould have failed.

We this simple example in mind, we proceed todescribe the algorithm more precisely. The SDE al-gorithm uses a semidefinite program [VB96], whichis essentially a linear program with additional con-straints that force the variables to form a positivesemidefinite matrix. Recall that a matrix is a Gram

4In the algorithm presented, we are looking for eigenvectors.The eigenvectors are the eigenfunctions evaluated at the datapoints.

0 1 2 3 4 5 6−1

−0.5

0

0.5

1

0 1 2 3 4 5 6−1

−0.5

0

0.5

1

0 1 2 3 4 5 6 7 8−1

−0.5

0

0.5

1

Figure 8: An illustration of the idea behind SDE. Thetop figure is a few points sampled from a sine curve.The middle figure shows the result of connecting eachpoint to its two neighbors. The bottom shows theresult of the “pulling” step.

matrix (an inner product matrix) if and only if itis positive semidefinite. The SDE algorithm uses asemidefinite program to find an appropriate Grammatrix, then extracts a configuration from the Grammatrix using the same trick employed in classicalMDS.

The primary constraint of the program is that dis-tances between neighboring points remain the same–i.e if xi and xj are neighbors then ‖yi − yj‖ =‖xi − xj‖, where yi and yj are the low-dimensionalrepresentatives of xi and xj . However, we must ex-press this constraint in terms of the Gram matrix,B, since this is the variable the semidefinite programworks with (and not Y ). Translating the constraintis simple enough, though, since

‖yi − yj‖2 = ‖yi‖2 + ‖yj‖2 − 2〈yi, yj〉= 〈yi, yi〉+ 〈yj , yj〉 − 2〈yi, yj〉= Bii + Bjj − 2Bij .

The constraint, then is

Bii + Bjj − 2Bij = ‖xi − xj‖2

for all neighboring points xi, xj .We wish to “pull” the remainder of the points as

far apart as possible, so we maximize the interpointdistances of our configuration subject to the aboveconstraint. The objective function is:

max∑

i,j

‖yi − yj‖2 = max∑

i,j

Bii + Bjj − 2Bij

10

Semidefinite Embeddinginput: x1, . . . , xn ∈ RD, k

1. Solve the following semidefinite program:maximize tr(B)subject to:

∑ij Bij = 0;

Bii + Bjj − 2Bij = ‖xi − xj‖2(for all neighboring xi, xj);

B º 0.

2. Compute the spectral decomposition of B: B =UΛU>.

3. Form Λ+ by setting [Λ+]ij := max{Λij , 0}.

4. Return Y := UΛ1/2+ .

Figure 9: Semidefinite Embedding.

= max∑

i,j

Bii + Bjj

= max tr(B)

The second equality follows from a regularizationconstraint: since the interpoint distances are invari-ant to translation, there is an extra degree of freedomin the problem which we remove by enforcing

∑

ij

Bij = 0.

The semidefinite program may be solved by usingany of a wide variety of software packages. The sec-ond part of the algorithm is identical to the secondpart of classical MDS: the eigenvalue decompositionUΛU> is computed for B and the new coordinatesare given by UΛ1/2. Additionally, the eigenvalues re-veal the underlying dimensionality of the manifold.The algorithm is summarized in figure 9.

One major gripe with SDE is that solving asemidefinite program requires tremendous computa-tional resources. State-of-the-art semidefinite pro-gramming packages cannot handle an instance withn ≥ 2000 or so on a powerful desktop machine. Thisis a a rather severe restriction since real data setsoften contain many more than 2000 points. One pos-sible approach to solving larger instances would beto develop a projected subgradient algorithm [Ber99].This technique resembles gradient descent, but canbe used to handle non-differentiable convex functions.This approach has been exploited for a similar prob-lem: finding robust embeddings of dissimilarity in-formation [CD05]. This avenue has not been pursuedfor SDE.

Instead, there is an approximation known as `SDE[WPS05], which uses matrix factorization to keep thesize of the semidefinite program down. The idea is tofactor the Gram matrix B into QLQ>, where L is al×l Gram matrix of landmark points and l ¿ n. Thesemidefinite program only finds L and then the rest ofthe points are located with respect to the landmarks.This method was a full two orders of magnitude fasterthan SDE in some experiments [WPS05]. This ap-proximation to SDE appears to produce high-qualityresults, though more investigation is needed.

3.5 Other algorithms

We have looked at four algorithms for manifold learn-ing, but there are certainly more. In this section, webriefly discuss two other algorithms.

As discussed above, both the LLE and Lapla-cian Eigenmaps algorithms are Laplacian-based al-gorithms. Hessian LLE (hLLE) [DG03] is closelyrelated to these algorithms, but replaces the graphLaplacian with a Hessian5 estimator. Like the LLEalgorithm, hLLE looks at locally linear patches ofthe manifold and attempts to map them to a low-dimensional space. LLE models these patches byrepresenting each xi as a weighted combination ofits neighbors. In place of this technique, hLLE runsprincipal components analysis on the neighborhoodset to get a estimate of the tangent space at xi (i.e.the linear patch surrounding xi). The second stepof LLE maps these linear patches to linear patchesin Rd using the weights found in the first step toguide the process. Similarly, hLLE maps the tangentspace estimates to linear patches in Rd, warping thepatches as little as possible. The “warping” factoris the Frobenius norm of the Hessian matrix. Themapping with minimum norm is found by solving aneigenvalue problem.

The hLLE algorithm has a fairly strong optimal-ity guarantee – stronger, in some sense, then thatof Isomap. Whereas Isomap requires that the pa-rameter space be convex and the embedded man-ifold be globally isometric to the parameter space,hLLE requires only that the parameter space be con-nected and locally isometric to the manifold. In thesecases hLLE is guaranteed to asymptotically recoverthe parameterization. Additionally, some empiricalevidence supporting this theoretical claim has beengiven: Donoho and Grimes [DG03] demonstrate that

5Recall that the Hessian of a function is the matrix of itspartial second derivatives.

11

hLLE outperforms Isomap on a variant of the swissroll data set where a hole has been punched out ofthe surface. Though its theoretical guarantee andsome limited empirical evidence make the algorithma strong contender, hLLE has yet to be widely used,perhaps partly because it is a bit more sophisticatedand difficult to implement than Isomap or LLE. Itis also worth pointing out the theoretical guaranteesare of a slightly different nature than those of Isomap:whereas Isomap has some finite sample error bounds,the theorems pertaining to hLLE hold only in thecontinuum limit. Even so, the hLLE algorithm isa viable alternative to many of the algorithms dis-cussed.

Another recent algorithm proposed for manifoldlearning is Local Tangent Space Alignment (LTSA)[ZZ04]. Similar to hLLE, the LTSA algorithm esti-mates the tangent space at each point by performingPCA on its neighborhood set. Since the data is as-sumed to lie on a d-dimensional manifold, and belocally linear, these tangent spaces will admit a d-dimensional basis. These tangent spaces are alignedto find global coordinates for the points. The align-ment is computed by minimizing a cost function thatallows any linear transformation of each local coor-dinate space. LTSA is a relatively recent algorithmthat has not find widespread popularity yet; as such,its relative strengths and weaknesses are still underinvestigation.

There are a number of other algorithms in addi-tion the ones described here. We have touched onthose algorithms that have had the greatest impacton developments in the field thus far.

4 Comparisons

We have discussed a myriad of different algorithms,all of which are for roughly the same task. Choosingbetween these algorithms is a difficult task; there isno clear “best” one. In this section, we discuss thetrade-offs associated with each algorithm.

4.1 Embedding type

One major distinction between these algorithms isthe type of embedding that they compute. Isomap,Hessian LLE, and Semidefinite Embedding all lookfor isometric embeddings —i.e. they all assume thatthere is a coordinate chart from the parameter spaceto the high-dimensional space that preserves inter-point distances and attempt to uncover this chart. In

contrast, LLE (and the C-Isomap variant of Isomap)look for conformal mappings – mappings which pre-serve local angles, but not necessarily distances. It re-mains open whether the LLE algorithm actually canrecover such mappings correctly. It is unclear exactlywhat type of embedding the Laplacian Eigenmaps al-gorithm computes [BN03]; it has been dubbed “prox-imity preserving” in places [WS04]. This breakdownis summarized in the following table.

Algorithm MappingC-Isomap conformalIsomap isometricHessian LLE isometricLaplacian Eigenmaps unknownLocally Linear Embedding conformalSemidefinite Embedding isometric

This comparison is not particularly practical, sincethe form of embedding for a particular data set is usu-ally unknown; however, if, say, an isometric methodfails, then perhaps a conformal method would be areasonable alternative.

4.2 Local versus global

Manifold learning algorithms are commonly splitinto two camps: local methods and global methods.Isomap is considered a global method because it con-structs an embedding derived from the geodesic dis-tance between all pairs of points. LLE is considered alocal method because the cost function that it uses toconstruct an embedding only considers the placementof each point with respect to its neighbors. Similarly,Laplacian Eigenmaps and the derivatives of LLE arelocal methods. Semidefinite embedding, on the otherhand, straddles the local/global division: it is a lo-cal method in that intra-neighborhood distances areequated with geodesic distances, but a global methodin that its objective function considers distances be-tween all pairs of points. Figure 10 summarizes thebreakdown of these algorithms.

The local/global split reveals some distinguishingcharacteristics between algorithms in the two cate-gories. Local methods tend to characterize the localgeometry of manifolds accurately, but break down atthe global level. For instance, the points in the em-bedding found by local methods tend be well placedwith respect to their neighbors, but non-neighborsmay be much nearer to one another than they areon the manifold. The reason for these aberrationsis simple: the constraints on the local methods do

12

SDE

LLE

Lap Eig

Isomap

Local Global

Figure 10: The local/global split.

not pertain to non-local points. The behavior forIsomap is just the opposite: the intra-neighborhooddistances for points in the embedding tend to be in-accurate whereas the large interpoint distances tendto be correct. Classical MDS creates this behavior;its cost function contains fourth-order terms of thedistances, which exacerbates the differences in scalebetween small and large distances. As a result, theemphasis of the cost function is on the large inter-point distances and the local topology of points isoften corrupted in the embedding process. SDE be-haves more like the local methods in that the dis-tances between neighboring points are guaranteed tobe the same before and after the embedding.

The local methods also seem to handle sharp curva-ture in manifolds better than Isomap. Additionally,shortcut edges – edges between points that shouldnot actually be classified as neighbors – can have apotentially disastrous effect in Isomap. A few short-cut edges can badly damage all of geodesic distancesapproximations. In contrast, this type of error doesnot seem to propagate through the entire embeddingwith the local methods.

4.3 Time complexity

The primary computational bottleneck for all of thesealgorithms – except for semidefinite embedding – isa spectral decomposition. Computing this decompo-sition for a dense n × n matrix requires O(n3) timesteps. For sparse matrices, however, a spectral de-composition can be performed much more quickly.The global/local split discussed above turns out tobe a dense/sparse split as well.

First, we look at the local methods. LLE builds

a matrix of weights W ∈ Rn×n such that Wij 6= 0only if j is one of i’s nearest neighbors. Thus, Wis very sparse: each row contains only k nonzero ele-ments and k is usually small compared to the numberof points. LLE performs a sparse spectral decom-position of (I − W )>(I − W ). Moreover, this ma-trix contains some additional structure that may beused to speed up the spectral decomposition [SR03].Similarly, the Laplacian Eigenmaps algorithm findsthe eigenvectors of the generalized eigenvalue prob-lem Ly = λDy. Recall that L

def= D − W , whereW is a weight matrix and D is the diagonal matrixof row-sums of W . All three of these matrices aresparse: W only contains k nonzero entries per row,D is diagonal, and L is their difference.

The two LLE variants described, Hessian LLE andLocal Tangent Space Alignment, both require onespectral decomposition per point; luckily, the matri-ces to be decomposed only contain n×k entries, so thedecompositions requires only O(k3) time steps. Bothalgorithms also require a spectral decomposition ofan n× n matrix, but these matrices are sparse.

For all of these local methods, the spectral decom-positions scale, pessimistically, with O((d + k)n2),which is a substantial improvement over O(n3).

The spectral decomposition employed by Isomap,on the other hand, is not sparse. The actual de-composition, which occurs during the cMDS phaseof the algorithm, is of B

def= − 12HDH, where D is

the matrix of approximate interpoint square geodesicdistances, and H is a constant matrix. The ma-trix B ∈ Rn×n only has rank approximately equalto the dimensionality of the manifold, but is never-theless dense since D has zeros only on the diago-nal. Thus the decomposition requires a full O(n3)timesteps. This complexity is prohibitive for largedata sets. To manage large data sets, LandmarkIsomap was created. The overall time complexityof Landmark Isomap is the much more manageableO(dln log n + l3), but the solution is only an approx-imate one.

Semidefinite embedding is by far the most compu-tationally demanding of these methods. It also re-quires a spectral decomposition of a dense n× n ma-trix, but the real bottleneck is the semidefinite pro-gram. Semidefinite programming methods are stillrelatively new and, perhaps as a result, quite slow.The theoretical time complexity of these proceduresvaries, but standard contemporary packages cannothandle an instance of the SDE program if the num-ber of points is greater than 2000 or so. Smaller in-

13

stances can be handled, but often require an enor-mous amount of time. The `SDE approximation is afast alternative to SDE that has been shown to yielda speedup of two orders of magnitude in some experi-ments [WPS05]. There has been no work relating thetime requirements of `SDE back to other manifoldlearning algorithms yet.

To summarize: the local methods – LLE, LaplacianEigenmaps, and variants – scale well up to very largedata sets; Isomap can handle medium size data setsand the Landmark approximation may be used forlarge data sets; SDE can handle only fairly small in-stances, though its approximation can tackle mediumto large instances.

4.4 Other issues

One shortcoming of Isomap, first pointed out as atheoretical issue in [BdSLT00] and later examinedempirically, is that it requires that the underlyingparameter space be convex. Without convexity, thegeodesic estimates will be flawed. Many manifoldsdo not have convex parameter spaces; one simple ex-ample is given by taking the swiss roll and punchinga hole out of it. More importantly, [DG02] foundthat many image manifolds do not have this prop-erty. None of the other algorithms seem to have thisshortcoming.

The only algorithms with theoretical optimalityguarantees are Hessian LLE and Isomap. Both ofthese guarantees are asymptotic ones, though, so it isquestionable how useful they are. On the other hand,these guarantees at least demonstrate that hLLE andIsomap are conceptually correct.

Isomap and SDE are the only two algorithms thatautomatically provide an estimate of the dimensional-ity of the manifold. The other algorithms requires thetarget dimensionality to be know a priori and oftenproduce substantially different results for differentchoices of dimensionality. However, the recent Con-formal Eigenmaps algorithm provides a patch withwhich to automatically estimate the dimensionalitywhen using Laplacian Eigenmaps, LLE, or its vari-ants.

5 What’s left to do?

Though manifold learning has become an increasinglypopular area of study within machine learning and re-lated fields, much is still unknown. One issue is thatall of these algorithms require a neighborhood size

parameter k. The solutions found are often very dif-ferent depending on the choice of k, yet there is littlework on actually choosing k. One notable exceptionis [WZZ04], but these techniques are discussed onlyfor LTSA.

Perhaps the most curious aspect of the field is thatthere are already so many algorithms for the sametask, yet new ones continue to be developed. In thispaper, we have discussed a total of nine algorithms,but our coverage was certainly not exhaustive. Whyare there so many?

The primary reason for the abundance of algo-rithms is that evaluating these algorithms is difficult.Though there are a couple of theoretical performanceresults for these algorithms, the primary evaluationmethodology has been to run the algorithm on some(often artificial) data sets and see whether the re-sults are intuitively pleasing. More rigorous evalu-ation methodologies remain elusive. Performing anexperimental evaluation requires knowing the under-lying manifolds behind the data and then determiningwhether the manifold learning algorithm is preserv-ing useful information. But, determining whether areal-world data set actually lies on a manifold is prob-ably as hard as learning the manifold; and, using anartificial data set may not give realistic results. Addi-tionally, there is little agreement on what constitutes“useful information” – it is probably heavily problem-dependent. This uncertainty stems from a certainslack in the problem: though we want to find a chartinto low-dimensional space, there may be many suchcharts, some of which are more useful than others.Another problem with experimentation is that onemust somehow select a realistic cross-section of man-ifolds.

Instead of an experimental evaluation, one mightestablish the strength of an algorithm theoretically.In many cases, however, algorithms used in machinelearning do not have strong theoretical guaranteesassociated with them. For example, Expectation-Maximization can perform arbitrarily poorly on adata set, yet it is a tremendously useful and popu-lar algorithm. A good manifold learning algorithmmay well have similar behavior. Additionally, thetype of theoretical results that are of greatest utilityare finite-sample guarantees. In contrast, the algo-rithmic guarantees for hLLE and Isomap6, as well asa number of other algorithms from machine learning,are asymptotic results.

6The theoretical guarantees for Isomap are a bit more usefulas they are able to bound the error in the finite sample case.

14

Devising strong evaluation techniques looks to be adifficult task, yet is absolutely necessary for progressin this area. Perhaps even more fundamental is deter-mining whether real-world data sets actually exhibita manifold structure. Some video and image data setshave been found to feature this structure, but muchmore investigation needs to be done. It may wellturn out that most data sets do not contain embed-ded manifolds, or that the points lie along a manifoldonly very roughly. If the fundamental assumption ofmanifold learning turns out to be unrealistic for datamining, then it needs to be determined if these tech-niques can be adapted so that they are still useful fordimensionality reduction.

The manifold leaning approach to dimensionalityreduction shows alot of promise and has had somesuccesses. But, the most fundamental questions arestill basically open:

Is the assumption of a manifold structurereasonable?

If it is, then:

How successful are these algorithms atuncovering this structure?

Acknowledgments

Thanks to Sanjoy Dasgupta for valuable sugges-tions. Thanks also to Henrik Wann Jensen andVirginia de Sa for being on my committee. Fig-ures 1 and 3 were generated using modified ver-sions of code downloaded from the Isomap website(http://isomap.stanford.edu) and the HessianLLE website (http://basis.stanford.edu/HLLE).

References

[BdSLT00] Mira Bernstein, Vin de Silva, John Lang-ford, and Joshua Tenenbaum. Graph ap-proximations to geodesics on embeddedmanifolds. Manuscript, Dec 2000.

[Ber99] Dimitri P. Bertsekas. Nonlinear Progam-ming. Athena Scientific, 2nd edition,1999.

[BN01] M. Belkin and P. Niyogi. Laplacian eigen-maps and spectral techniques for em-bedding and clustering. In Advances inNeural Information Processing Systems,2001.

[BN03] Mikhail Belkin and Partha Niyogi. Lapla-cian eigenmaps for dimensionality re-duction and data representation. Neu-ral Computation, 15(6):1373–1396, June2003.

[Bur05] Christopher J.C. Burges. Geometricmethods for feature extraction and di-mensional reduction. In L. Rokach andO. Maimon, editors, Data Mining andKnowledge Discovery Handbook: A Com-plete Guide for Practitioners and Re-searchers. Kluwer Academic Publishers,2005.

[CC01] T.F. Cox and M.A.A. Cox. Multidimen-sional Scaling. Chapman and Hall/CRC,2nd edition, 2001.

[CD05] Lawrence Cayton and Sanjoy Dasgupta.Cost functions for euclidean embedding.Unpublished manuscript, 2005.

[DG02] David L. Donoho and Carrie Grimes.When does isomap recover the naturalparametrization of families of articulatedimages? Technical report, Department ofStatistics, Stanford University, 2002.

[DG03] David L. Donoho and Carrie Grimes.Hessian eigenmaps: new locally lin-ear embedding techniques for high-dimensional data. Proceedings of the Na-tional Academy of Sciences, 100:5591–5596, 2003.

[dST03] Vin de Silva and Joshua Tenenbaum.Global versus local methods in nonlineardimensionality reduction. In S. Becker,S. Thrun, and K. Obermayer, editors,Advances in Neural Information Process-ing Systems, 2003.

[dST04] Vin de Silva and Joshua B. Tenenbaum.Sparse multidimensional scaling usinglandmark points. Manuscript, June 2004.

[Mil65] John W Milnor. Topology from the differ-entiable viewpoint. The University Pressof Virginia, 1965.

[Ple03] Robert Pless. Using isomap to explorevideo sequences. In Proc. InternationalConference on Computer Vision, 2003.

15

[SR03] L. Saul and S. Roweis. Think globally, fitlocally: unsupervised learning of low di-mensional manifolds. Journal of MachineLearning Research, 4:119–155, 2003.

[SS05] F. Sha and L.K. Saul. Analysis and ex-tension of spectral methods for nonlineardimensionality reduction. Submitted toICML, 2005.

[TdL00] J. B. Tenenbaum, V. deSilva, and J. C.Langford. A global geometric frameworkfor nonlinear dimensionality reduction.Science, 290:2319–2323, 2000.

[VB96] Lieven Vandenberghe and Stephen Boyd.Semidefinite programming. SIAM Re-view, 1996.

[WPS05] K. Weinberger, B. Packer, and L. Saul.Nonlinear dimensionality reduction bysemidefinite programming and kernel ma-trix factorization. In Proceedings of theTenth International Workshop on Artifi-cial Intelligence and Statistics, 2005.

[WS04] K. Q. Weinberger and L. K. Saul. Unsu-pervised learning of image manifolds bysemidefinite programming. In Proceed-ings of the IEEE Conference on Com-puter Vision and Pattern Recognition,2004.

[WSS04] K.Q. Weinberger, F. Sha, and L.K. Saul.Learning a kernel matrix for nonlinear di-mensionality reduction. In Proceedings ofthe International Conference on MachineLearning, 2004.

[WZZ04] Jing Wang, Zhenyue Zhang, andHongyuan Zha. Adaptive manifoldlearning. In Advances in Neural Infor-mation Processing Systems, 2004.

[ZZ03] Hongyuan Zha and Zhenyue Zhang.Isometric embedding and continuumisomap. In Proceedings of the Twenti-eth International Conference on MachineLearning, 2003.

[ZZ04] Zhenyue Zhang and Hongyuan Zha. Prin-cipal manifolds and nonlinear dimensionreduction via local tangent space align-ment. SIAM Journal of Scientific Com-puting, 26(1):313–338, 2004.

A Proof of theorem 1

First, we restate the theorem.

Theorem. A nonnegative symmetric matrix D ∈Rn×n, with zeros on the diagonal, is a Euclidean dis-tance matrix if and only if B

def= − 12HDH, where

Hdef= I − 1

n11>, is positive semidefinite. Further-more, this B will be the Gram matrix for a mean-centered configuration with interpoint distances givenby D.

Proof: Suppose that D is a Euclidean distance ma-trix. Then, there exist x1, . . . , xn such that ‖xi −xj‖2 = Dij for all i, j. We assume without loss thatthese points are mean-centered —i.e.

∑i xij = 0 for

all j. We expand the norm:

Dij = ‖xi−xj‖2 = 〈xi, xi〉+〈xj , xj〉−2〈xi, xj〉, (11)

so

〈xi, xj〉 = −12(Dij − 〈xi, xi〉 − 〈xj , xj〉). (12)

Let Bijdef= 〈xi, xj〉. We proceed to rewrite Bij in

terms of D. From (11),

∑

i

Dij =

(∑

i

Bii +∑

i

Bjj − 2∑

i

Bij

)

=∑

i

(Bii + Bjj).

The second equality follows because the configurationis mean-centered. It follows that

1n

∑

i

Dij =1n

∑

i

Bii + Bjj . (13)

Similarly,

1n

∑

j

Dij =1n

∑

j

Bjj + Bii, and (14)

1n2

∑

ij

Dij =2n

∑

i

Bii. (15)

Subtracting (15) from the sum of (14) and (13)leaves Bjj + Bii on the right-hand side, so we mayrewrite (12) as

Bij = −12

Dij − 1

n

∑

i

Dij − 1n

∑

j

Dij +1n2

∑

ij

Dij

= −12[HDH]ij .

16

Thus, − 12HDH is a Gram matrix for x1, . . . , xn.

Conversely, suppose that Bdef= − 1

2HDH is positivesemidefinite. Then, there is a matrix X such thatXX> = B. We show that the points given by therows of X have square interpoint distances given byD.

First, we expand B using its definition:

Bij = −12[HDH]ij

= −12

Dij − 1

n

∑

i

Dij − 1n

∑

j

Dij +1n2

∑

ij

Dij

= −12

(d2

ij − d2i − d2

j + d2)

.

Let xi be the ith row of X. Then,

‖xi − xj‖2 = ‖xi‖2 + ‖xj‖2 − 2〈xi, xj〉= Bii + Bjj − 2Bij

= Dij .

The final equality follows from expanding Bij , Bii,and Bjj as written above. Thus, we have shown thatthe configuration x1, . . . , xn have square interpointdistances given by D, so D is a Euclidean distancematrix. ¤

17

algorithms for manifold learning - lawrence cayton · algorithms for manifold learning lawrence...

Documents