reduces time complexity: less computation reduces space complexity: less parameters simpler models...
Post on 18-Jan-2016
232 Views
Preview:
TRANSCRIPT
• Reduces time complexity: Less computation• Reduces space complexity: Less parameters• Simpler models are more robust on small datasets• More interpretable; simpler explanation• Data visualization (beyond 2 attributes, it gets
complicated)
1Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Why Reduce Dimensionality?
Feature Selection vs Extraction
Feature selection: Chose k<d important features, ignore the remaining d – k
Data snoopingGenetic algorithm
Feature extraction: Project the original d attributes onto a new k<d dimensional feature space
Principal components analysis (PCA), Linear discriminant analysis (LDA), Factor analysis (FA)Auto-association ANN
2Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Principal Components Analysis (PCA)
Assume that attributes in dataset are drawn from a multivariate normal distribution. P(x)=N(m, S)
3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
TdE ,...,:Mean 1μx
TE μxμx dx1 1xddxd
221
22221
11221
ddd
d
d
Variance is a matrixcalled “covariance”.
Diagonal elements are s2 of individual attributes.
Off diagonals describe how fluctuations in one attribute affect
fluctuations in another.
TE μμ xx
dx1 1xddxd
221
22221
11221
ddd
d
d
Dividing off-diagonal elements by the product of variances, gives “correlation coefficients”
Correlation among attributes makes it difficult to say how any one attribute contributes to an effect.
1, ji
ijijji xx
Corr
Consider a linear transformation of attributes z = Mx where M is a dxd matrix. The d features z will also be normally distributed (proof later).
A choice of M that results in a diagonal covariance matrix in feature-space has the following advantages: 1. Interpretation of uncorrelated features is easier2. Total variance of features is the sum of diagonal elements
Diagonalization of the covariance matrix:
The transformation z = Mx that leads to a diagonal feature-space covariance has M = WT where the columns of W are the eigenvectors of the covariance matrix .S
The collection of eigenvalue equations Swk = lkwk
can be written as SW = WD where D = diag(l1...ld) and W is formed by column vectors [w1 ... wd].
WT= W-1 so WTSW = W-1WD = D
If we arrange the eigenvectors so that eigenvalues l1...ld are in decreasing order of magnitude, then zi = wi
Tx, i = 1…k < d are the “principle components”
Proportion of Variance (PoV) explained by k principal components (λi sorted in descending order) is
dk
k
21
21
7Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
A plot PoV vs k shows how many eigenvalues arerequired in capture given part of total variance
How many principal components ?
Proof that if attributes x are normally distributed with mean m and covariance S, then z=wTx is normally distributed with mean wTm and variance wTSw.
Var(z) = Var(wTx) = E[(wTx – wTμ)2] = E[(wTx – wTμ)(xTw – mTw)]= E[wT(x – μ)(x – μ)Tw]
= wT E[(x – μ)(x –μ)T]w = wT ∑ w
The objective of PCA is to maximize Var(z)=wT ∑ w Must be done subject to the constraint ||w1|| = w1
Tw1 = 1
8Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Review: constrained optimization by Lagrange multipliers
find the stationary point of f(x1, x2) = 1 - x1
2 – x22
subject to the constraint g(x1, x2) = x1 + x2 = 1
Constrained optimization
Form the Lagrangian
L(x, l) = f(x1, x2) + l(g(x1, x2) - c)
L(x, l) = 1-x12-x2
2 +l(x1+x2-1)
-2x1 + l = 0-2x2 + l = 0x1 + x2 -1 = 0Solve for x1 and x2
Set the partial derivatives of L with respect to x1, x2, and l equal to zero
L(x, l) = 1-x12-x2
2 +l(x1+x2-1)
In this case, not necessary to find ll sometimes called “undetermined multiplier”
Solution isx1* = x2* = ½
Application of Lagrange multipliers in PCA
Find w1 such that w1TSw1 is maximum subject
to constraint ||w1|| = w1Tw1 = 1
Maximize L = w1TSw1 + c(w1
Tw1 – 1)
gradient of L = 2Sw1+ 2cw1 = 0
Sw1 = -cw1
w1 is an eigenvector of covariance matrix
let c = -l1
l1 is eigenvalue associate with w1
13Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
Prove that l1 is the variance of principal component 1
z1 = w1Tx
Sw1 = l1 w1
var(z1) = w1TSw1 = l1 w1
Tw1 = l1
To maximize var(z1), chose l1 as largest eigenvalue
14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)
More principal components:
If S has 2 distinct eigenvalues, define 2nd principal componentby max Var(z2), such that ||w2||=1 and orthogonal to w1
Introduce Lagrange multipliers a and b
01 122222 wwwwww TTTL
Set gradient of L with respect to w2 to zero2Sw2 – 2aw2 – bw1 = 0Choose b = 0 and a = l2 get Sw2 = l2w2
To maximize Var(z2) chose l2 as the second largest
eigenvalue
For any dxd matrix M, z=MTx is a linear transformation of attributes x that defines features z
If attributes x are normally distributed with mean m and covariance S, then z is normally distributed with mean MTm and covariance MTSM. (proof slide 8)
If M = W, a matrix with columns that are the normalized eigenvectors of S, then the covariance of z is diagonal with elements equal to the eigenvalues of S (proof slide 6)
Arrange the eigenvalues in decreasing order of magnitude and find l1...lk that account for most (e.g. 90%) of the total variance, then zi = wi
Tx, are the “principle components”
16
Review
MatLab’s [V,D] = eig(A) returns both eigenvectors (columns of V) and eigenvalues D in increasing order.
Invert the order and construct
17Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press
(V1.0)
More review
dk
k
21
21
Chose k that captures the desired amount of total variance
Example: cancer diagnostics
• Metabonomics data• 94 samples• 35 metabolites in each sample = d• 60 control samples• 34 diseased samples
dk
k
21
21
73.680918.74912.88561.90680.72780.54440.42380.35010.1631
proportion of variance plot
ranked eigenvalues
3 PCs capture > 95%
1-34 cancer>35 controlSamples from cancer patients cluster
Scatter plot of PCs 1 and 2
Assignment 5 due 10-30-15
Find the accuracy of a model that classifies all 6 types of beer bottles in glassdata.csv by multivariate linear regression. Find the eigenvalues and eigenvectors of the covariance matrix for the full beer-bottle data set. How many eigenvalues are required to capture more than the 90% of the variance? Transform the attribute data by the eigenvectors of the 3 largest eigenvalues. What is the accuracy of a linear model that uses these features.
Plot the accuracy when you successively extent the linear model by including z1
2, z22, z3
2, z1z2, z1z3, and z2z3.
PCA code for glass data
eige
nval
ues
indexed by decreasing magnitude
indexed by decreasing magnitude
PoV
Extend MLR with PCA features
L +x12 +x2
2 +x13 +x1x2 +x1x3 +x2x3
top related