lecture 14: principal components analysisinabd171/wiki.files/lecture14... · 2016-12-20 ·...
Post on 31-May-2020
4 Views
Preview:
TRANSCRIPT
Lecture 14: Principal ComponentsAnalysis
Introduction to Learningand Analysis of Big Data
Kontorovich and Sabato (BGU) Lecture 14 1 / 16
Unsupervised Learning
Until now: supervisedI receive labeled data (x , y)I try to learn mapping x 7→ y
Current topic: unsupervised
Receive unlabeled data only
What’s a reasonable “learning” goal?I discover some structure in the dataI does it clump nicely in space?
(clustering)I is it well-represented by a small set of points?
(sample compression/condensing)I does it have a sparse representation in some basis?
(compressed sensing)I does it lie on a low-dimensional manifold?
(dimensionality reduction)
•Kontorovich and Sabato (BGU) Lecture 14 2 / 16
Discovering structure in Data: Motivation
Computational: “simpler” data is faster to search/store/process
Statistical: “simpler” data allows for better generalization
Dimensionality reduction has a denoising effect
Example: human facesI 16x16 images and only grayscale: a 256-dimensional spaceI Curse of dimensionality: sample size exponential in dI Is learning faces hopeless?I Not if they have structureI Low-dimensional manifold [next slide]
Common theme: learning of a high-dimensional signal is possiblewhen its intrinsic structure is low-dimensional
•
Kontorovich and Sabato (BGU) Lecture 14 3 / 16
Faces manifold
[credit: N. Vasconcelos and A. Lippman]http://www.svcl.ucsd.edu/projects/manifolds
•
Kontorovich and Sabato (BGU) Lecture 14 4 / 16
Principal Components Analysis (PCA)
Dimensionality reduction technique for data in Rd
Problem statement [draw on board]:I given m points x1, . . . , xm in Rd
I and target dimension k < dI find “best” k-dimensional subspace approximating the data
Formally: find matrices U ∈ Rd×k and V ∈ Rk×d
that minimize
f (U,V ) =m∑i=1
‖xi − UVxi‖22
V : Rd → Rk is “compressor”, U : Rk → Rd is “decompressor”
Is f : Rd×k × Rk×d → R convex?
No! f (U, ·) and f (·,V ) both convex, but not f (·, ·)Claim: optimal solution achieved at U = V> and U>U = I(columns of U are orthonormal)
•Kontorovich and Sabato (BGU) Lecture 14 5 / 16
Principal Components Analysis: auxiliary claimOptimization problem: minimizeU∈Rd×k ,V∈Rk×d
∑mi=1 ‖xi − UVxi‖22
Claim: optimal solution achieved at U = V> and U>U = IProof:
I For any U,V , linear map x 7→ UVx has range R of dimension k
I Let w1, . . . ,wk be orthonormal basis for R; arrange into columns of W .
I Hence, for each xi there is zi ∈ Rk such that UVxi = Wzi .
I Note: W>W = I .
I Which z minimizes f (xi , z) := ‖xi −Wz‖22?
I For all x ∈ Rd , z ∈ Rk ,f (x , z) = ‖x‖22 + z>W>Wz − 2z>W>x = ‖x‖22 + ‖z‖22 − 2z>W>x .
I Minimize w.r.t. z : Oz f = 2z − 2W>x = 0 =⇒ z = W>x .
I Thereforem∑i=1
‖xi − UVxi‖22 =m∑i=1
‖xi −Wzi‖22 ≥m∑i=1
‖xi −WW>xi‖22.
I U,V are optimal, so∑m
i=1 ‖xi − UVxi‖22 =∑m
i=1 ‖xi −WW>xi‖22.I So instead of U,V can take W ,W>.
WW>x is the orthogonal projection of x onto R. [board]
•Kontorovich and Sabato (BGU) Lecture 14 6 / 16
PCA: reformulated
Optimization problem:
minimizeU∈Rd×k :U>U=I
m∑i=1
‖xi − UU>xi‖22 (∗)
For every x ∈ Rd and U ∈ Rd×k with U>U = I ,
‖x − UU>x‖2 = ‖x‖2 − 2x>UU>x + x>UU>UU>x
= ‖x‖2 − x>UU>x
= ‖x‖2 − trace(x>UU>x)
= ‖x‖2 − trace(U>xx>U).
Recall trace operator:I Defined for any square matrix A: sum of A’s diagonal entriesI For any A ∈ Rp×q,B ∈ Rq×p, we have trace(AB) = trace(BA)I trace : Rp×p → R is a linear map
Reformulating (*):
maximizeU∈Rd×k :U>U=I
trace
(U>
m∑i=1
xix>i U
).
•Kontorovich and Sabato (BGU) Lecture 14 7 / 16
Spectral theorem refresher
Suppose A ∈ Rn×n is symmetric, A = A>
and Av = λv for some v ∈ Rn and λ ∈ RWe say that v is an eigenvector of A with eigenvalue λ
Fact: for symmetric A, eignevectors of distinct eigenvalues areorthogonal. Proof:
I if Av = λv , Au = µu, A = A>
I then u>Av = λ〈u, v〉 = µ〈u, v〉I hence, λ 6= µ =⇒ 〈u, v〉 = 0
Theorem: for every symmetric A ∈ Rn×n there is an orthogonalV ∈ Rn×n (i.e., V> = V−1) s.t. V>AV = Λ = diag(λ1, . . . , λn).
V diagonalizes A
V = [v1, . . . , vn] and vi is the eigenvector of A corresponding to λi
The vi s are orthonormal (V>V = I ) and span Rn
A of the form X>X or XX> are positive semidefinite: Λ ≥ 0
•Kontorovich and Sabato (BGU) Lecture 14 8 / 16
Maximizing the trace by k top eigenvalues
We want to
maximizeU∈Rd×k :U>U=I
trace
(U>
m∑i=1
xix>i U
).
Put A =∑m
i=1 xix>i .
Diagonalize: A = VΛV>
I V>V = VV> = II Λ = diag(λ1, . . . , λd), λ1 ≥ λ2 ≥ . . . ≥ λd ≥ 0
So goal is:
maximizeU∈Rd×k :U>U=I
trace(U>VΛV>U
).
•
Kontorovich and Sabato (BGU) Lecture 14 9 / 16
Maximizing the trace by k top eigenvaluesGoal:
maximizeU∈Rd×k :U>U=I
trace(U>VΛV>U
).
We now show an upper bound on this value.Fix any U ∈ Rd×k with U>U = I and put B = V>U.then U>VλV>U = B>ΛB.Hence, goal is equal to trace(ΛBB>) =
∑dj=1 λj
∑ki=1(Bji )
2.
Define βj =∑k
i=1(Bji )2.
B>B = U>VV>U = U>U = I . Therefore‖β‖1 =
∑dj=1 βj = trace(BB>) = trace(B>B) = trace(I ) = k.
Claim: βj ≤ 1. Proof:I choose B ∈ Rd×d s.t. B[:, 1 : k] = B and B>B = I . Hence BT = B−1.I So BB> = I . Hence ∀j ∈ [d ],
∑di=1(Bji )
2 = 1 =⇒∑k
i=1(Bji )2 ≤ 1.
hence:
trace(U>VΛV>U) ≤ maxβ∈[0,1]d :‖β‖1≤k
d∑j=1
λjβj =k∑
j=1
λj .
•Kontorovich and Sabato (BGU) Lecture 14 10 / 16
PCA: solution
We have:I A =
∑mi=1 xix
>i
I A = VΛV>
I V>V = VV> = II Λ = diag(λ1, . . . , λd), λ1 ≥ λ2 ≥ . . . ≥ λd ≥ 0I U ∈ Rd×k with U>U = I
Under the above, we showed trace(U>AU) ≤∑k
j=1 λj .
Get this with equality: set U = [u1, u2, . . . , uk ] where Aui = λi ui
ith column of U is A’s eigenvector corresponding to λi
then trace(U>AU) =∑k
j=1 λj .
So U is the optimal solution.
•
Kontorovich and Sabato (BGU) Lecture 14 11 / 16
PCA: solution value
Recall our original goal: minimizeU∈Rd×k :U>U=I
∑mi=1 ‖xi − UU>xi‖22.
We showed, for A =∑m
i=1 xix>i ,
argminU∈Rd×k :U>U=I
m∑i=1
‖xi − UU>xi‖22 = argmaxU∈Rd×k :U>U=I
trace(U>A>U).
We just showed that U maximizes trace(U>AU).
Value of solution:I U = [u1, u2, . . . , uk ] (k top eigenvectors), trace(U>AU) =
∑kj=1 λj .
I∑m
i=1 ‖xi − UU>xi‖22 =∑m
i=1 ‖xi‖22 − trace(U>AU)I∑m
i=1 ‖xi‖22 = trace(A) = trace(VΛV>) = trace(V>VΛ) = trace(Λ) =∑di=1 λi .
I Optimal objective value:∑m
i=1 ‖xi − UU>xi‖22 =∑d
i=k+1 λi .
This is the “distortion” — how much data we throw away.
•Kontorovich and Sabato (BGU) Lecture 14 12 / 16
PCA: the variance view
Suppose the data S = {xi , i ≤ m} is centered: 1m
∑mi=1 xi = 0
Let’s project it onto a 1-dim subspace w/max variance [board]
1-dim projection operator: uu>, for u ∈ Rd , u>u = 1
VarX∼S [u>X ] =1
m
m∑i=1
(u>xi )2 =
1
m
m∑i=1
u>(xix>i )u
Optimization problem: maximizeu∈Rd :u>u=1
m∑i=1
u>(xix>i )u
Looks familiar? PCA!
Maximized by eigenvector u1 of A = XX> corresponding to topeigenvalue λ1What about u2?
Subtract off data component spanned by u1 and repeat.
PCA ≡ choosing the dimensions that maximize sample variance.
•Kontorovich and Sabato (BGU) Lecture 14 13 / 16
PCA computational complexity
Computing A =∑m
i=1 xix>i costs O(md2) time
Diagonalizing A = VΛV> costs O(d3) time
What if d � m?
Write A = X>X , where X = [x>1 ; x>2 ; . . . ; x>m ] ∈ Rm×d
Put B = XX> ∈ Rm×m; then Bij = 〈xi , xj〉Diagonalize B instead of A: get B’s eigenvalues and eigenvectors.
Suppose Bu = λu for some u ∈ Rm, λ ∈ Rthen A(X>u) = X>XX>u = λX>u.
hence, X>u is an eigenvector of A with eigenvalue λ
Conclusion: can diagonalize B instead of A
Pay O(m2d) time to compute B and O(m3) time to diagonalize it
•
Kontorovich and Sabato (BGU) Lecture 14 14 / 16
PCA and generalization
PCA often used as a pre-processing step to supervised learning (e.g., SVM)
Q: How to choose target dimension k? (At least in supervised setting.)
Theorem
Suppose S ={
(xi , yi ) ∈ Rd × {−1, 1} , i ≤ m}
, is drawn from Dm. Consider hypothesisspace H consisting of all hyperplanes w ∈ Rn, and define hinge loss`(y , y ′) = max {0, 1− yy ′}.If for k ≤ d can run PCA on the sample and get distortion η :=
∑di=k+1 λi .
Then w/high prob., for all w ∈ H
risk`(w ,D) ≤ risk`(w , S) + O
(√k
m+
√η
m
).
bound does not depend on full dimension d!
bound holds even without running PCA (sufficient for w , k, η as above to exist)
result suggests reasonable values for k
other analyses use random matrix theory
•Kontorovich and Sabato (BGU) Lecture 14 15 / 16
PCA summary
Unsupervised learning technique
Often, pre-processing step for supervised; has denoising effect
Performs dimensionality reduction from dim d to dim k < d
Optimization problem: find rank-k orthogonal projection UU> thatminimizes mean square distortion on data
∑mi=1 ‖xi − UU>xi‖22
Solution: define data matrix A =∑m
i=1 xix>i , diagonalize it, choose
U = [u1, . . . , uk ] to be the top k eigenvectors of A
Equivalently: find k orthogonal directions that capture most of thedata variance
Computational cost:I O(md2 + d3) if m� dI O(m2d + m3) if d � m
Generalization (in some cases) depends only on k and distortion butnot on d
•Kontorovich and Sabato (BGU) Lecture 14 16 / 16
top related