cs570 introduction to data mining - emory university scaled to fall within a small, specified range...

38
CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong

Upload: truonghanh

Post on 26-May-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

CS570 Introduction to Data Mining

Department of Mathematics and Computer Science

Li Xiong

Data Exploration and Data Preprocessing

� Data and Attributes

� Data exploration

� Data pre-processing

Data Mining: Concepts and Techniques 2

� Data cleaning

� Data integration

� Data transformation

� Data reduction

Data Transformation

� Aggregation: summarization (data reduction)

� E.g. Daily sales -> monthly sales

� Discretization and generalization

� E.g. age -> youth, middle-aged, senior

� (Statistical) Normalization: scaled to fall within a small, specified

range

January 25, 2011 3

range

� E.g. income vs. age

� Attribute construction: construct new attributes from given ones

� E.g. birthday -> age

Data Aggregation

� Data cubes store multidimensional aggregated information

� Multiple levels of aggregation for analysis at multiple granularities

January 25, 2011 4

Normalization

� scaled to fall within a small, specified range

� Min-max normalization: [minA, maxA] to [new_minA, new_maxA]

� Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then

$73,000 is mapped to

� Z-score normalization (µ: mean, σ: standard deviation):

716.00)00.1(000,12000,98

000,12600,73=+−

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(' +−

−=

January 25, 2011 5

� Z-score normalization (µ: mean, σ: standard deviation):

� Ex. Let µ = 54,000, σ = 16,000. Then

� Normalization by decimal scaling

A

Avv

σ

µ−='

j

vv

10'= Where j is the smallest integer such that Max(|ν’|) < 1

225.1000,16

000,54600,73=

Discretization and Generalization

� Discretization: transform continuous attribute into discrete

counterparts (intervals)

� Supervised vs. unsupervised

� Split (top-down) vs. merge (bottom-up)

� Generalization: generalize/replace low level concepts (such as age

January 25, 2011 6

� Generalization: generalize/replace low level concepts (such as age

ranges) by higher level concepts (such as young, middle-aged, or

senior)

Discretization Methods

� Binning or histogram analysis

� Unsupervised, top-down split

� Clustering analysis

� Unsupervised, either top-down split or bottom-up

January 25, 2011 7

� Entropy-based discretization

� Supervised, top-down split

� Entropy based on class distribution of the samples in a set S1 : m classes, pi is

the probability of class i in S1

� Given a set of samples S, if S is partitioned into two intervals S1 and S2 using

boundary T, the class entropy after partitioning is

Entropy-Based Discretization

|||| 21 SS+=

∑=

−=m

i

ii ppSEntropy1

21 )(log)(

� The boundary that minimizes the entropy function is selected for binary

discretization

� The process is recursively

applied to partitions

January 25, 2011 8

)(||

||)(

||

||),( 2

21

1SEntropy

S

SSEntropy

S

STSI +=

Information Entropy� Information entropy: measure of the uncertainty associated with a

random variable.

� Quantifies the information contained in a message with minimum message length (# bits) to communicate

� Illustrative example:

9

� P(X=A) = ¼, P(X=B) = ¼, P(X=C) = ¼, P(X=D) = ¼

� BAACBADCDADDDA…

� Minimum 2 bits (e.g. A = 00, B = 01, C = 10, D = 11)

� 0100001001001110110011111100…

� What if P(X=A) = ½, P(X=B) = ¼, P(X=C) = 1/8, P(X=D) = 1/8

� Minimum # of bits?

� High entropy vs. low entropy

E.g. A = 0, B = 10, C= 110, D = 111

Generalization for Categorical Attributes

� Specification of a partial/total ordering of attributes explicitly at the

schema level by users or experts

� street < city < state < country

� Specification of a hierarchy for a set of values by explicit data

grouping

� {Atlanta, Savannah, Columbus} < Georgia

January 25, 2011 10

� {Atlanta, Savannah, Columbus} < Georgia

� Automatic generation of hierarchies (or attribute levels) by the

analysis of the number of distinct values

� E.g., for a set of attributes: {street, city, state, country}

Automatic Concept Hierarchy Generation

� Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set

� The attribute with the most distinct values is placed at the lowest level of the hierarchy

� Exceptions, e.g., weekday, month, quarter, year

January 25, 2011 11

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values

Data Exploration and Data Preprocessing

� Data and Attributes

� Data exploration

� Data pre-processing

Data Mining: Concepts and Techniques 12

� Data cleaning

� Data integration

� Data transformation

� Data reduction

Data Reduction

� Why data reduction?

� A database/data warehouse may store terabytes of data� Number of data points

� Number of dimensions

� Complex data analysis/mining may take a very long time to run on the complete data set

Data reduction

January 25, 2011 13

� Data reduction

� Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results

Data Reduction

� Instance reduction

� Sampling (instance selection)

� Numerosity reduction

� Dimension reduction

� Feature selection� Feature selection

� Feature extraction

14

Instance Reduction: Sampling

� Sampling: obtaining a small representative sample s to represent the whole data set N

� A sample is representative if it has approximately the same property (of interest) as the original set of data

� Statisticians sample because obtaining the entire set of data is too expensive or time consuming.

January 25, 2011 15

data is too expensive or time consuming.

� Data miners sample because processing the entire set of data is too expensive or time consuming

� Sampling method

� Sampling size

Why sampling A statistics professor was describing sampling theory

Student: I don’t believe it, why not study the whole population in the first place?

The professor continued explaining sampling methods, the central limit theorem, etc.

Student: Too much theory, too risky, I couldn’t trust just a few numbers in place of ALL of them.

The professor explained the Nielsen television

16

The professor explained the Nielsen television ratings

Student: You mean that just a sample of a few thousand can tell us exactly what over 250 MILLION people are doing?

Professor: Well, the next time you go to the campus clinic and they want to do a blood test…tell them that’s not good enough …tell them to TAKE IT ALL!!”

Sampling Methods

� Simple Random Sampling� There is an equal probability of selecting any particular item

� Stratified sampling� Split the data into several partitions (stratum); then draw random

samples from each partition

� Cluster sampling� Cluster sampling� When "natural" groupings are evident in a statistical population

� Sampling without replacement� As each item is selected, it is removed from the population

� Sampling with replacement� Objects are not removed from the population as they are selected

for the sample - the same object can be picked up more than once

Simple random sampling without or with replacement

January 25, 2011 18

Raw Data

Stratified Sampling Illustration

Raw Data Stratified Sample

January 25, 2011 19

Sampling size

20

Sampling Size

8000 points 2000 Points 500 Points

Sample Size

� What sample size is necessary to get at least one object from

each of 10 groups.

Data Reduction

� Instance reduction

� Sampling (instance selection)

� Numerosity reduction

� Dimension reduction

� Feature selection� Feature selection

� Feature extraction

23

Numerosity Reduction

� Reduce data volume by choosing alternative, smaller forms of data representation

� Parametric methods

� Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)

January 25, 2011 24

outliers)

� Regression

� Non-parametric methods

� Do not assume models

� Major families: histograms, clustering

Regress Analysis

� Assume the data fits some model and estimate model

parameters

� Linear regression: Y = b0 + b1X1 + b2X2 + … + bPXP

� Line fitting: Y = b1X + b0

� Polynomial fitting: Y = b2x2 + b1x + b0� Polynomial fitting: Y = b2x2 + b1x + b0

� Regression techniques

� Least square fitting

� Vertical vs. perpendicular offsets

� Outliers

� Robust regression

Instance Reduction: Histograms

� Divide data into buckets and store average (sum) for each bucket

� Partitioning rules:

� Equi-width: equal bucket range

� Equi-depth: equal frequency

� V-optimal: with the least frequency variance

January 25, 2011 26

Instance Reduction: Clustering

� Partition data set into clusters based on similarity, and store cluster

representation (e.g., centroid and diameter) only

� Can be very effective if data is clustered but not if data is “smeared”

� Can have hierarchical clustering and be stored in multi-dimensional

January 25, 2011 27

index tree structures

� Cluster analysis will be studied in depth later

Data Reduction

� Instance reduction

� Sampling (instance selection)

� Numerosity reduction

� Dimension reduction

� Feature selection� Feature selection

� Feature extraction

28

Feature Subset Selection

� Select a subset of features such that the resulting data does not affect mining result

� Redundant features

� duplicate much or all of the information contained in one or more other attributes

Example: purchase price of a product and the amount � Example: purchase price of a product and the amount of sales tax paid

� Irrelevant features

� contain no information that is useful for the data mining task at hand

� Example: students' ID is often irrelevant to the task of predicting students' GPA

Correlation between attributes

� Correlation measures the linear relationship between objects

30

Correlation Analysis (Numerical Data)

� Correlation coefficient (also called Pearson’s product

moment coefficient)

BABA n

BAnAB

n

BBAAr BA

σσσσ )1(

)(

)1(

))((,

−=

−−=

∑∑

January 25, 2011 31

where n is the number of tuples, and are the respective means

of A and B, σA and σB are the respective standard deviation of A and

B, and Σ(AB) is the sum of the AB cross-product.

� rA,B > 0, A and B are positively correlated (A’s values increase as B’s)

� rA,B = 0: independent

� rA,B < 0: negatively correlated

A B

Visually Evaluating Correlation

Scatter plots showing the Pearson correlation the Pearson correlation from –1 to 1.

Correlation Analysis (Categorical Data)

� Χ2 (chi-square) test

� The larger the Χ2 value, the more likely the variables are

related

∑−

=Expected

ExpectedObserved 22 )(

χ

January 25, 2011 33

related

� The cells that contribute the most to the Χ2 value are

those whose actual count is very different from the

expected count

Chi-Square Calculation: An Example

Χ2 (chi-square) calculation (numbers in parenthesis are

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

January 25, 2011 34

� Χ2 (chi-square) calculation (numbers in parenthesis are

expected counts calculated based on the data distribution

in the two categories)

� It shows that like_science_fiction and play_chess are

correlated in the group (10.828 needed to reject the

independence hypothesis)

93.507840

)8401000(

360

)360200(

210

)21050(

90

)90250( 22222 =

−+

−+

−+

−=χ

Metrics of (in)dependence

� Mutual Information: mutual dependence between two attributes

� What’s the mutual information between 2 completely independent attributes?independent attributes?

� Kullback–Leibler divergence: asymmetric

35

Feature Selection

� Brute-force approach:

� Try all possible feature subsets

� Heuristic methods

� Step-wise forward selection

� Step-wise backward elimination

Combining forward selection and backward elimination� Combining forward selection and backward elimination

Feature Selection� Filter approaches:

� Features are selected independent of data mining algorithm

� E.g. Minimal pair-wise correlation/dependence, top k information entropy

� Wrapper approaches:

� Use the data mining algorithm as a black box to find best subset

� E.g. best classification accuracy� E.g. best classification accuracy

� Embedded approaches:

� Feature selection occurs naturally as part of the data mining algorithm

� E.g. Decision tree classification

37

Data Reduction

� Instance reduction

� Sampling

� Aggregation

� Dimension reduction

� Feature selection� Feature selection

� Feature extraction/creation

38