pca unit 4

7/29/2019 PCA UNIT 4

1/19

N.MAHESWARI (10UCS29)

C.MALARVIZHI (10UCS30)

7/29/2019 PCA UNIT 4

2/19

Reductionist

Principal component analysisis a variable-reduction procedure. It is useful when we have obtained data on alarge number of variables and believe

that there is someredundancyin those variables.

In this case, redundancy means that some of the variables arecorrelatedwith one

another, possibly because they are measuring the same construct.

Because of this redundancy, we believe that it should be possible to reduce the

observed variables into a smaller number of principal component s(artificial

variables)that will account for most of the variance in the observed variables.

7/29/2019 PCA UNIT 4

3/19

What is Principal Component Analysis?

Principal component analysis (PCA)

Reduce the dimensionality of a data set by finding a new set of variables, smaller

than the original set of variables

Retains most of the sample's information.

Useful for the compression and classification of data.

By information we mean the variation present in the sample, given by the

correlations between the original variables.

The new variables, called principal components (PCs), are

uncorrelated, and are ordered by the amount of the total information each retains.

7/29/2019 PCA UNIT 4

4/19

Why Dimensionality Reduction?

It is so easy and convenient to collect data An experiment

Data is not collected only for data mining

Data accumulates in an unprecedented speed

Data preprocessing is an important part for effective machine learninganddata

mining

Dimensionality reduction is an effective approach to downsizing data.

7/29/2019 PCA UNIT 4

5/19

Why PCA?

Most machine learning and data mining techniques may not be effective forhigh dimensional data

Curse of Dimensionality

Classification accuracy and efficiency degrade rapidly as the dimension

increases.

The intrinsic dimension may be small.

For example, the number of genes responsible for a

certain type of disease may be small.

7/29/2019 PCA UNIT 4

6/19

Visualization:

projection of high-dimensional data onto 2D or

3D.

Data compression:

efficient storage and retrieval.

Feature extraction:extract useful features

7/29/2019 PCA UNIT 4

7/19

PRINCIPAL COMPONENT ANALYSIS

Reduces the number of predictors by finding the weighted

linear combinations of predictors that retain most of the

variance in the data set.

These are called principal components PCA works onlywith continuous variables

7/29/2019 PCA UNIT 4

8/19

DIMENSIONALITY REDUCTION

Prerequisite for dimensionality reduction is

understanding the data, using e.g. data summaries (min,

max, avg, mean, median, stdev) and visualization.

Domain knowledge should always be applied first to

remove predictors known to be unapplicable (e.g. height

for predicting client income)

Correlation analysis, principal component analysis,

and binning

7/29/2019 PCA UNIT 4

9/19

CORRELATION ANALYSIS

7/29/2019 PCA UNIT 4

10/19

7/29/2019 PCA UNIT 4

11/19

The PCA is required in several scientific fields, such as

psychometrics, telecommunications,

electroencephalography, stock markets and others.

Customer relationship management Text mining

Image retrieval

Microarray data analysis

Protein classification

Face recognition

Handwritten digit recognition

Intrusion detection

7/29/2019 PCA UNIT 4

12/19

Feature selectionDefinitionObjectives

Feature Extraction (reduction)

Definition

Objectives

Differences between the two techniques

7/29/2019 PCA UNIT 4

13/19

Definition

A process that chooses an optimal subset of features accordingto a objective function

Objectives

To reduce dimensionality and remove noise

To improve mining performance

Speed of learning

Predictive accuracy

Simplicity and comprehensibility of mined results

7/29/2019 PCA UNIT 4

14/19

Feature reduction refers to the mapping of the original high-

dimensional data onto a lower dimensional space

Given a set of data points of p variables {x1,x2,..xn}

Criterion for feature reduction can be different based on different

problem settings.

Unsupervised setting: minimize the information loss

Supervised setting: maximize the class discrimination

7/29/2019 PCA UNIT 4

15/19

Feature reduction

All original features are used

The transformed features are linear combinations of theoriginal features

Feature selection

Only a subset of the original features are selected

Continuous versus discrete

7/29/2019 PCA UNIT 4

16/19

It is computationally inexpensive, It can be applied to ordered and unordered attributes,

It can handle sparse data and skewed data.

Multidimensional data of more than two dimensions can be

handled.

7/29/2019 PCA UNIT 4

17/19

ALGORITHM:

The PCA algorithm consists of 5 main steps:

Subtract the mean: subtract the mean from each of the data dimensions.

The mean subtracted is the average across each dimension. This produces a

data set whose mean is zero.

Calculate the covariance matrix:

Where is a matrix which each entry is the result of calculating

the covariance between two separate dimensions.

Calculate the eigenvectors and eigenvalues of the covariance matrix.

7/29/2019 PCA UNIT 4

18/19

Choose components and form a feature vector: once eigenvectors are found from

the covariance matrix, the next step is to order them by eigenvalue, highest to lowest.

So that the components are sorted in order of significance. The number of

eigenvectors that you choose will be the number of dimensions of the new data set.The objective of this step is construct a feature vector (matrix of vectors).

From the list of eigenvectors take the eigenvectors selected and form a matrix with

them in the columns:

Feature Vector = (eig_1, eig_2, ..., eig_n)

7/29/2019 PCA UNIT 4

19/19

Derive the new data set. Take the transpose of the Feature Vector and

multiply it on the left of the original data set, transposed:

Final Data = RowFeatureVector x RowDataAdjusted

where RowFeatureVector is the matrix with the eigenvectors in the columns

transposed (the eigenvectors are now in the rows and the most significant are

in the top) and RowDataAdjusted is the mean-adjusted data transposed (the

data items are in each column, with each row holding a separate dimension).

pca unit 4

Documents