pca unit 4
TRANSCRIPT
-
7/29/2019 PCA UNIT 4
1/19
N.MAHESWARI (10UCS29)
C.MALARVIZHI (10UCS30)
-
7/29/2019 PCA UNIT 4
2/19
Reductionist
Principal component analysisis a variable-reduction procedure. It is useful when we have obtained data on alarge number of variables and believe
that there is someredundancyin those variables.
In this case, redundancy means that some of the variables arecorrelatedwith one
another, possibly because they are measuring the same construct.
Because of this redundancy, we believe that it should be possible to reduce the
observed variables into a smaller number of principal component s(artificial
variables)that will account for most of the variance in the observed variables.
-
7/29/2019 PCA UNIT 4
3/19
What is Principal Component Analysis?
Principal component analysis (PCA)
Reduce the dimensionality of a data set by finding a new set of variables, smaller
than the original set of variables
Retains most of the sample's information.
Useful for the compression and classification of data.
By information we mean the variation present in the sample, given by the
correlations between the original variables.
The new variables, called principal components (PCs), are
uncorrelated, and are ordered by the amount of the total information each retains.
-
7/29/2019 PCA UNIT 4
4/19
Why Dimensionality Reduction?
It is so easy and convenient to collect data An experiment
Data is not collected only for data mining
Data accumulates in an unprecedented speed
Data preprocessing is an important part for effective machine learninganddata
mining
Dimensionality reduction is an effective approach to downsizing data.
-
7/29/2019 PCA UNIT 4
5/19
Why PCA?
Most machine learning and data mining techniques may not be effective forhigh dimensional data
Curse of Dimensionality
Classification accuracy and efficiency degrade rapidly as the dimension
increases.
The intrinsic dimension may be small.
For example, the number of genes responsible for a
certain type of disease may be small.
-
7/29/2019 PCA UNIT 4
6/19
Visualization:
projection of high-dimensional data onto 2D or
3D.
Data compression:
efficient storage and retrieval.
Feature extraction:extract useful features
-
7/29/2019 PCA UNIT 4
7/19
PRINCIPAL COMPONENT ANALYSIS
Reduces the number of predictors by finding the weighted
linear combinations of predictors that retain most of the
variance in the data set.
These are called principal components PCA works onlywith continuous variables
-
7/29/2019 PCA UNIT 4
8/19
DIMENSIONALITY REDUCTION
Prerequisite for dimensionality reduction is
understanding the data, using e.g. data summaries (min,
max, avg, mean, median, stdev) and visualization.
Domain knowledge should always be applied first to
remove predictors known to be unapplicable (e.g. height
for predicting client income)
Correlation analysis, principal component analysis,
and binning
-
7/29/2019 PCA UNIT 4
9/19
CORRELATION ANALYSIS
-
7/29/2019 PCA UNIT 4
10/19
-
7/29/2019 PCA UNIT 4
11/19
The PCA is required in several scientific fields, such as
psychometrics, telecommunications,
electroencephalography, stock markets and others.
Customer relationship management Text mining
Image retrieval
Microarray data analysis
Protein classification
Face recognition
Handwritten digit recognition
Intrusion detection
-
7/29/2019 PCA UNIT 4
12/19
Feature selectionDefinitionObjectives
Feature Extraction (reduction)
Definition
Objectives
Differences between the two techniques
-
7/29/2019 PCA UNIT 4
13/19
Definition
A process that chooses an optimal subset of features accordingto a objective function
Objectives
To reduce dimensionality and remove noise
To improve mining performance
Speed of learning
Predictive accuracy
Simplicity and comprehensibility of mined results
-
7/29/2019 PCA UNIT 4
14/19
Feature reduction refers to the mapping of the original high-
dimensional data onto a lower dimensional space
Given a set of data points of p variables {x1,x2,..xn}
Criterion for feature reduction can be different based on different
problem settings.
Unsupervised setting: minimize the information loss
Supervised setting: maximize the class discrimination
-
7/29/2019 PCA UNIT 4
15/19
Feature reduction
All original features are used
The transformed features are linear combinations of theoriginal features
Feature selection
Only a subset of the original features are selected
Continuous versus discrete
-
7/29/2019 PCA UNIT 4
16/19
It is computationally inexpensive, It can be applied to ordered and unordered attributes,
It can handle sparse data and skewed data.
Multidimensional data of more than two dimensions can be
handled.
-
7/29/2019 PCA UNIT 4
17/19
ALGORITHM:
The PCA algorithm consists of 5 main steps:
Subtract the mean: subtract the mean from each of the data dimensions.
The mean subtracted is the average across each dimension. This produces a
data set whose mean is zero.
Calculate the covariance matrix:
Where is a matrix which each entry is the result of calculating
the covariance between two separate dimensions.
Calculate the eigenvectors and eigenvalues of the covariance matrix.
-
7/29/2019 PCA UNIT 4
18/19
Choose components and form a feature vector: once eigenvectors are found from
the covariance matrix, the next step is to order them by eigenvalue, highest to lowest.
So that the components are sorted in order of significance. The number of
eigenvectors that you choose will be the number of dimensions of the new data set.The objective of this step is construct a feature vector (matrix of vectors).
From the list of eigenvectors take the eigenvectors selected and form a matrix with
them in the columns:
Feature Vector = (eig_1, eig_2, ..., eig_n)
-
7/29/2019 PCA UNIT 4
19/19
Derive the new data set. Take the transpose of the Feature Vector and
multiply it on the left of the original data set, transposed:
Final Data = RowFeatureVector x RowDataAdjusted
where RowFeatureVector is the matrix with the eigenvectors in the columns
transposed (the eigenvectors are now in the rows and the most significant are
in the top) and RowDataAdjusted is the mean-adjusted data transposed (the
data items are in each column, with each row holding a separate dimension).