tony jebara, columbia university advanced machine learning & perception instructor: tony jebara
Post on 29-Dec-2015
217 Views
Preview:
TRANSCRIPT
Tony Jebara, Columbia University
Advanced Machine Learning & Perception
Instructor: Tony Jebara
Tony Jebara, Columbia University
Topic 13•Manifolds Continued and Spectral Clustering
•Convex Invariance Learning (CoIL)
•Kernel PCA (KPCA)
•Spectral Clustering & N-Cuts
Tony Jebara, Columbia University
Manifolds Continued•PCA: linear manifold•MDS: get inter-point distances, find 2D data with same•LLE: mimic neighborhoods using low dimensional vectors•GTM: fit a grid of Gaussians to data via nonlinear warp
•Linear PCA after Nonlinear normalization/invariance of data•Manifold is Linear PCA in Hilbert space (Kernels)•Spectral Clustering in Hilbert space
Tony Jebara, Columbia University
Convex Invariance Learning•PCA is appropriate for finding a linear manifold•Variation in data is only modeled linearly•But, many problems are nonlinear•However, the nonlinear variations may be irrelevant:
Images: morph, rotate, translate, zoom… Audio: pitch changes, ambient acoustics… Video: motion, camera view, angles… Genomics: proteins fold, insertions, deletions… Databases: fields swapped, formats, scaled…
•Imagine a “Gremlin” is corrupting your data by multiplying each input vector Xt by a type of matrix At to give AtXt
•Idea: remove nonlinear irrelevant variations before PCA•But, make this part of PCA optimization, not pre-processing
Tony Jebara, Columbia University
Convex Invariance Learning•Example of irrelevant variation in our data: permutation in image data… each image Xt is multiplied by a permutation matrix At by gremlin. Must clean it.•When we convert images to a vector, we are assuming arbitrary meaningless ordering (like Gremlin mixing order)
•This arbitrary ordering causes wild nonlinearities (manifold)•We should not trust ordering, assume gremlin has permuted it with arbitrary permutation matrix…
® 1127 12 54 1 3 85 1 4 84T
tX é ù= ê úë û
K
Tony Jebara, Columbia University
Permutation Invariance•Permutation is irrelevant variation in our data.•Gremlin is permuting fields in our input vectors•So, view a datum as “Bag of Vectors” instead single vector i.e. grayscale image = Set of Vectors or Bag of Pixels
N pixels, each is a D=3 XYI tuple matrix Ai by gremlin. Must clean it.
•Treat each input as permutable “Bag of Pixels”
xxx
Tony Jebara, Columbia University
Optimal Permutation
( ) ( ) ( ){ }1 1 1 2 2 21 1 1 1
, , , , , , , , ,N N Nx X Y I X Y I X Y I=r
K
( ) ( ) ( ){ }5 5 5 8 8 8 2 2 21 1 1 1 1 1 1 1 1 1
, , , , , , , , ,x X Y I X Y I X Y I=r
K
( ) ( ) ( ){ }1 1 1 2 2 22 2 2 2
, , , , , , , , ,N N Nx X Y I X Y I X Y I=r
K
( ) ( ) ( ){ }3 3 3 4 4 4 9 9 92 2 2 2 2 2 2 2 2 2
, , , , , , , , ,x X Y I X Y I X Y I=r
K
•Vectorization / Rasterization: uses index in image to sort pixels into large vector.
•If we knew “optimal” correspondence: could fix sorting pixels in bag into large vector more appropriately
… we don’t know it, must learn it…
Tony Jebara, Columbia University
PCA on Permutated Data•In non-permuted vector images, linear changes & eigenvectors are additions & deletions of intensities (bad!). Translating, raising eyebrows, etc. = erasing & redrawing
•In bag of pixels (vectorized only after knowing optimal permutation) get linear changes & eigenvectors are morphings, warpings, jointly spatial & intensity change
Tony Jebara, Columbia University
Permutation as a Manifold•Assume order unknown. “Set of Vectors” or “Bag of Pixels” Get permutational invariance (order doesn’t matter)•Can’t represent invariance by single ‘X’ vector point in DxN space since we don’t know the ordering•Get permutation invariance by ‘X’ spanning all possible reorderings. Multiply X by unknown A matrix (permutation or doubly-stochastic)x
x
( )
( )
( )
( )
1
2
3
4
é ùê úê úê úê úê úê úê úë û
( )
( )
( )
( )
2
1
3
4
é ùê úê úê úê úê úê úê úë û
( )
( )
( )
( )
4
1
3
2
é ùê úê úê úê úê úê úê úë û
( )
( )
( )
( )
4
1
2
3
é ùê úê úê úê úê úê úê úë û
( )
( )
( )
( )
1
2
3
4
é ùê úê úê úê úê úê úê úë û
x
x
x
Tony Jebara, Columbia University
Invariant Paths as Matrix Ops•Move vector along manifold by multiplying by matrix:Restrict A to be permutation matrix (operator)Resulting manifold of configurations is “orbit” if A is groupOr, for smooth manifold, make A doubly-stochastic matrixEndow each image in dataset with own transformation matrix . Each is now a bag or manifold:
X AX
{ }1 1,...,
T TA X A X
Tony Jebara, Columbia University
A Dataset of Invariant Manifolds•E.g. assume model is PCA, learn 2D subspace of 3D data•Permutation lets points move independently along paths•Find PCA after moving to form ‘tight’ 2D subspace •More generally, move along manifolds to improve fit of any model (PCA, SVM, probability density, etc.)
Tony Jebara, Columbia University
Optimizing the Permutations•Optimize: modeling cost & linear constraints on matrices•Estimate transformation parameters and model parameters (PCA, Gaussian, SVM)•Cost on matrices A emerges from modeling criterion•Typically, get a Convex Cost with Convex Hull of Constraints (Unique!)
•Since A matrices are soft permutation matrices (doubly-stochastic) we have:
1 1 0ij ij ijt t t
i j
A A A= = ³å å
{ }1,...,
TA A A=
Q
( )1min , ,
TAC A AK
: 0 ,ij ijt td td
ij
subjectto A Q b t d+ ³ "å
Q
Tony Jebara, Columbia University
Example Cost: Gaussian Mean•Maximum Likelihood Gaussian Mean Model:
•Theorem 1: C(A) is convex in A (Convex Program) •Can solve via a quadratic program on the A matrices•Minimizing the trace of a covariance tries to pull the data spherically towards a common mean
( ) ( ), log ; ,t tt
l A A X Im = N må 1ˆ T t ttA Xm= å
( )2
12 2ˆ ˆ, log2TD
t ttl A A Xm = - p- - må
( ) ( ) ( )( )ˆ,C A l A trace Cov AX= - m =
Tony Jebara, Columbia University
Example Cost: Gaussian Cov
•Theorem 2: Regularized log determinant of covariance is convex. Equivalently, minimize
•Theorem 3: Cost non-quadratic but upper boundable by quad. Iteratively solve with QP with variational bound:
•Min’ing determinant flattens data into low volume pancake
( ) ( ), , log ; ,t tt
l A A XmS = N mSå1ˆ T t tt
A Xm= å( ) ( ) ( )11
2 2 2ˆ ˆ ˆˆ ˆ ˆ, , log2 log
TTD T
t t t ttl A A X A X-mS = - p- S - - m S - må
( ) ( ) ( )ˆˆ, ,C A l A Cov AX= - mS =
( )( )1ˆ ˆ ˆT
T t t t ttA X A XS = - m - må
( ) ( )1 10 0 0 0
log logS trace S S S trace S S- -£ + -
( ) ( )( )logCov AX I tr Cov AX+ e + e
Tony Jebara, Columbia University
Example Cost: Fisher Discrimin.•Find linear Fisher Discriminant model w that maximizes ratio of between & within-class scatter
•For discriminative invariance, transformation matrices should increase between-class scatter (numerator) and should reduce within class scatter (denominator)
•Minimizing above permutes data to make classification easy
maxT
w T
w Uww Sw
S -+= S +S
( )( )TU - -+ += m - m m - mxx
xx
x
x
x
xx
x xx
xx
x
x
x
x
xx
x x
( ) ( )( )TC A - - -+ + += S +S - l m - m m - m
Tony Jebara, Columbia University
Interpreting C(A)•Maximum Likelihood Mean Permute data towards common mean
•Maximum Likelihood Mean & Covariance Permute data towards flat subspace Pushes energy into few eigenvectors Great as pre-processing before PCA
•Fisher Discriminant Permute data towards two flat subspaces while repelling away from each other’s means
Tony Jebara, Columbia University
SMO Optimization of QP•Quadratic Programming used for all C(A) since:
Gaussian Mean quadraticGaussian Covariance upper boundable by
quadraticFisher Discriminant upper boundable by
quadratic
•Use Sequential Minimal Optimizationaxis parallel optimization, pick axes to update,ensure constraints not violated
•Soft permutation matrix 4 constraintsor 4 entries at a time
( ) 21 1mn pq q pm n mn pq q pm nT Ti i i i i j j i
mpnqi mpnqij
trace MS A A X M X A A X M X= -å å
, , ,mn mq pn pqt t t t
A A A A
Tony Jebara, Columbia University
XY Digits Permuted PCA
Original
PCA Permuted PCA
20 Images of ‘3’ and ‘9’Each is 70 (x,y) dotsNo order on the ‘dots’
PCA compress with samenumber of Eigenvectors
Convex Program firstestimates thepermutation better reconstruction
Tony Jebara, Columbia University
InterpolationIntermediate imagesare smooth morphs
Points nicelycorresponded
Spatial morphingversus ‘redrawing’
No ghosting
Tony Jebara, Columbia University
XYI Faces Permuted PCA
Original PCA Permuted Bag-of XYI Pixels PCA
2000 XYI Pixels: Compress to 20 dimsImprove squared error of PCA byAlmost 3 orders of magnitude x103
Tony Jebara, Columbia University
XYI Multi-Faces Permuted PCA
+/- Scaling on Eigenvector
Top 5Eigenvectors
All just linearvariations inbag of XYIpixels
Vectorizationnonlinearneeds huge #of eigenvectors
Tony Jebara, Columbia University
XYI Multi-Faces Permuted PCA
+/- Scaling on Eigenvector
Next 5Eigenvectors
Tony Jebara, Columbia University
•Replace all dot-products in PCA with kernel evaluations.•Recall, could do PCA on DxD covariance matrix of data
or NxN Gram matrix of data:•For nonlinearity, do PCA on feature expansions:
•Instead of doing explicit feature expansion, use kernel I.e. d-th order polynomial
•As usual, kernel must satisfy Mercer’s theorem•Assume, for simplicity, all feature data is zero-mean
Kernel PCA
1
1 N Ti ii
C xxN =
= å r r
Tij i j
K x x=
( ) ( ) ( ) ( ),dT T
ij i j i j i jK k x x x x x x= = ff =
( ) ( )1
1 TN
i iiC x x
N == ffå
v Cvl =r r Evals &
Evecssatisfy
( )1
0N
iix
=f =å
If data iszero-mean
Tony Jebara, Columbia University
Kernel PCA•Efficiently find & use eigenvectors of C-bar:•Can dot either side of above equation with feature vector:
•Eigenvectors are in span of feature vectors:•Combine equations:
v Cvl =r r
( )1
N
i iiv x
== a få
r( ) ( )T T
i ix v x Cvl f = f
r r
( ) ( )
( ) ( ){ } ( ) ( ){ }( ) ( ){ } ( ) ( ) ( ){ } ( ){ }
1 1
11 1 1
11 1 1
2
T T
i i
T TN N
i j j i j jj i
TT TN N N
Ni j j i j jk kj k j
N N N
Nj ij jik kjj k j
x v x Cv
x x x C x
x x x x x x
K K K
N K K
N K
= =
= = =
= = =
l f = f
l f a f = f a f
l f a f = ff f a f
l a = a
l a = a
l a = a
å å
å å å
å å å
r r
r r
r r
Tony Jebara, Columbia University
•From before, we had: this is an eig equation!•Get eigenvectors and eigenvalues of K•Eigenvalues are N times •For each eigenvector k there is an eigenvector vk
•Want eigenvectors v to be normalized:
•Can now use alphas only for doing PCA projection & reconstruction!
Kernel PCA( ) ( )T T
i ix v x Cv
N K
l f = f
l a = a
r r
r r
( )( ) ( )( )1 1
1
1
1
1
1
Tk k
TN Nk kj i j ji j
Tk k
Tk k k
Tk kk
v v
x x
K
N
N
= =
=
a f a f =
a a =
a l a =
a a =l
å å
r r
r
r
r
Tony Jebara, Columbia University
Kernel PCA•To compute k’th projection coefficient of a new point (x)
•Reconstruction*:
*Pre-image problem, linear combo in Hilbert goes outside•Can now do nonlinear PCA and do PCA on non-vectors•Nonlinear KPCA eigenvectors satisfy same properties as usual PCA but in Hilbert space. These evecs: 1) Top q have max variance 2) Top q reconstruction has with min mean square error 3) Are uncorrelated/orthogonal 4) Top have max mutual with inputs
( ) ( ) ( ){ } ( )1 1
,N Nk k k kT T
i i i ii ic x v x x k x x
= == f = f a f = aå å
r
( ) ( ) ( )1 1 1 1,
K K N Nk k k ki i j jk k i j
x c v k x x x= = = =
f = = a a få å å år%
Tony Jebara, Columbia University
Centering Kernel PCA•So far, we had assumed the data was zero-mean:•We want this:•How to do without touching feature space? Use kernels…
•Can get alpha eigenvectors from K tilde by adjusting old K
( )1
0N
iix
=f =å
( ) ( ) ( )11
N
Nj j iix x x
=f = f - få%
( ) ( )( ) ( )( ) ( ) ( )( )
( ) ( ) ( ) ( )( ) ( ) ( ) ( )
2
1 11 1
11
1 1 11 1 1
1 1 11 1 1 1
T
ij i j
TN N
N Ni jk kk k
TT N
Ni j jkk
TTN N N
N N Ni k k lk k l
N N N N
N Nij kj ik klNk k k l
K x x
x x x x
x x x x
x x x x
K K K K
= =
=
= = =
= = = =
= ff
= f - ff - f
= ff - ff
- ff + ff
= - - +
å å
åå å å
å å å å
% % %
Tony Jebara, Columbia University
•KPCA on 2d dataset
•Left-to-right Kernel poly order goes from 1 to 3
1=linear=PCA
•Top-to-bottom top evec to weaker evecs
2D KPCA
Tony Jebara, Columbia University
•Use coefficients of the KPCA for training a linear SVM classifier to recognize chairs from their images.•Use various polynomial kernel degrees where 1=linear as in regular PCA
Kernel PCA Results
Tony Jebara, Columbia University
•Use coefficients of the KPCA for training a linear SVM classifier to recognize characters from their images.•Use various polynomial kernel degrees where 1=linear as in regular PCA (worst case in experiments)•Inferior performance to nonlinear SVMs (why??)
Kernel PCA Results
Tony Jebara, Columbia University
•Typically, use EM or k-means to cluster N data points•Can imagine clustering the data points only from NxN matrix capturing their proximity information•This is spectral clustering•Again compute Gram matrix using, e.g. RBF kernel
•Example: have N pixels from an image, each x = [xcoord, ycoord, intensity] of each pixel•From eigenvectors of K matrix (or slight, variant), these seem to capture some segmentation or clustering of data points!•Nonparametric form of clustering since we didn’t assume Gaussian distribution…
Spectral Clustering
( ) ( ) ( ) ( )2
21
2, exp
T
ij i j i j i jK k x x x x x x
s= = ff = - -
Tony Jebara, Columbia University
Stability in Spectral Clustering•Standard problem when computing & using eigenvectors:
•Small changes in data can cause eigenvectors to change wildly•Ensure the eigenvectors we keep are distinct & stable: look at eigengap…•Some algorithms ensure the eigenvectors are going to have a safe eigengap. Adjust or process Gram matrix to ensure eigenvectors are still stable.
3 evecs=unsafe
3 evecs=safe
gap
Tony Jebara, Columbia University
Stabilized Spectral Clustering•Stabilized spectral clustering algorithm:
Tony Jebara, Columbia University
Stabilized Spectral Clustering•Example results compared to other clustering algorithms (traditional kmeans, unstable spectral clustering, connected components).
top related