pca unit 4

Upload: subithaperiyasamy

Post on 03-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 PCA UNIT 4

    1/19

    N.MAHESWARI (10UCS29)

    C.MALARVIZHI (10UCS30)

  • 7/29/2019 PCA UNIT 4

    2/19

    Reductionist

    Principal component analysisis a variable-reduction procedure. It is useful when we have obtained data on alarge number of variables and believe

    that there is someredundancyin those variables.

    In this case, redundancy means that some of the variables arecorrelatedwith one

    another, possibly because they are measuring the same construct.

    Because of this redundancy, we believe that it should be possible to reduce the

    observed variables into a smaller number of principal component s(artificial

    variables)that will account for most of the variance in the observed variables.

  • 7/29/2019 PCA UNIT 4

    3/19

    What is Principal Component Analysis?

    Principal component analysis (PCA)

    Reduce the dimensionality of a data set by finding a new set of variables, smaller

    than the original set of variables

    Retains most of the sample's information.

    Useful for the compression and classification of data.

    By information we mean the variation present in the sample, given by the

    correlations between the original variables.

    The new variables, called principal components (PCs), are

    uncorrelated, and are ordered by the amount of the total information each retains.

  • 7/29/2019 PCA UNIT 4

    4/19

    Why Dimensionality Reduction?

    It is so easy and convenient to collect data An experiment

    Data is not collected only for data mining

    Data accumulates in an unprecedented speed

    Data preprocessing is an important part for effective machine learninganddata

    mining

    Dimensionality reduction is an effective approach to downsizing data.

  • 7/29/2019 PCA UNIT 4

    5/19

    Why PCA?

    Most machine learning and data mining techniques may not be effective forhigh dimensional data

    Curse of Dimensionality

    Classification accuracy and efficiency degrade rapidly as the dimension

    increases.

    The intrinsic dimension may be small.

    For example, the number of genes responsible for a

    certain type of disease may be small.

  • 7/29/2019 PCA UNIT 4

    6/19

    Visualization:

    projection of high-dimensional data onto 2D or

    3D.

    Data compression:

    efficient storage and retrieval.

    Feature extraction:extract useful features

  • 7/29/2019 PCA UNIT 4

    7/19

    PRINCIPAL COMPONENT ANALYSIS

    Reduces the number of predictors by finding the weighted

    linear combinations of predictors that retain most of the

    variance in the data set.

    These are called principal components PCA works onlywith continuous variables

  • 7/29/2019 PCA UNIT 4

    8/19

    DIMENSIONALITY REDUCTION

    Prerequisite for dimensionality reduction is

    understanding the data, using e.g. data summaries (min,

    max, avg, mean, median, stdev) and visualization.

    Domain knowledge should always be applied first to

    remove predictors known to be unapplicable (e.g. height

    for predicting client income)

    Correlation analysis, principal component analysis,

    and binning

  • 7/29/2019 PCA UNIT 4

    9/19

    CORRELATION ANALYSIS

  • 7/29/2019 PCA UNIT 4

    10/19

  • 7/29/2019 PCA UNIT 4

    11/19

    The PCA is required in several scientific fields, such as

    psychometrics, telecommunications,

    electroencephalography, stock markets and others.

    Customer relationship management Text mining

    Image retrieval

    Microarray data analysis

    Protein classification

    Face recognition

    Handwritten digit recognition

    Intrusion detection

  • 7/29/2019 PCA UNIT 4

    12/19

    Feature selectionDefinitionObjectives

    Feature Extraction (reduction)

    Definition

    Objectives

    Differences between the two techniques

  • 7/29/2019 PCA UNIT 4

    13/19

    Definition

    A process that chooses an optimal subset of features accordingto a objective function

    Objectives

    To reduce dimensionality and remove noise

    To improve mining performance

    Speed of learning

    Predictive accuracy

    Simplicity and comprehensibility of mined results

  • 7/29/2019 PCA UNIT 4

    14/19

    Feature reduction refers to the mapping of the original high-

    dimensional data onto a lower dimensional space

    Given a set of data points of p variables {x1,x2,..xn}

    Criterion for feature reduction can be different based on different

    problem settings.

    Unsupervised setting: minimize the information loss

    Supervised setting: maximize the class discrimination

  • 7/29/2019 PCA UNIT 4

    15/19

    Feature reduction

    All original features are used

    The transformed features are linear combinations of theoriginal features

    Feature selection

    Only a subset of the original features are selected

    Continuous versus discrete

  • 7/29/2019 PCA UNIT 4

    16/19

    It is computationally inexpensive, It can be applied to ordered and unordered attributes,

    It can handle sparse data and skewed data.

    Multidimensional data of more than two dimensions can be

    handled.

  • 7/29/2019 PCA UNIT 4

    17/19

    ALGORITHM:

    The PCA algorithm consists of 5 main steps:

    Subtract the mean: subtract the mean from each of the data dimensions.

    The mean subtracted is the average across each dimension. This produces a

    data set whose mean is zero.

    Calculate the covariance matrix:

    Where is a matrix which each entry is the result of calculating

    the covariance between two separate dimensions.

    Calculate the eigenvectors and eigenvalues of the covariance matrix.

  • 7/29/2019 PCA UNIT 4

    18/19

    Choose components and form a feature vector: once eigenvectors are found from

    the covariance matrix, the next step is to order them by eigenvalue, highest to lowest.

    So that the components are sorted in order of significance. The number of

    eigenvectors that you choose will be the number of dimensions of the new data set.The objective of this step is construct a feature vector (matrix of vectors).

    From the list of eigenvectors take the eigenvectors selected and form a matrix with

    them in the columns:

    Feature Vector = (eig_1, eig_2, ..., eig_n)

  • 7/29/2019 PCA UNIT 4

    19/19

    Derive the new data set. Take the transpose of the Feature Vector and

    multiply it on the left of the original data set, transposed:

    Final Data = RowFeatureVector x RowDataAdjusted

    where RowFeatureVector is the matrix with the eigenvectors in the columns

    transposed (the eigenvectors are now in the rows and the most significant are

    in the top) and RowDataAdjusted is the mean-adjusted data transposed (the

    data items are in each column, with each row holding a separate dimension).