cs570: introduction to data mining - emory universitycengiz/cs570-data-mining-fa... · 11 cs570:...

44
1 1 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu, Ph.D. Some slides courtesy of Li Xiong, Ph.D. and ©2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.

Upload: hoangnhan

Post on 17-Apr-2018

232 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

1 1

CS570: Introduction to Data Mining

Fall 2013

Reading: Chapter 3 Han, Chapter 2 Tan

Anca Doloc-Mihu, Ph.D.

Some slides courtesy of Li Xiong, Ph.D. and

©2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.

Page 2: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

2

Data Exploration and Data Preprocessing

Data and Attributes

Data exploration

Data pre-processing

Data cleaning

Data integration

Data transformation

Data reduction

Page 3: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 3

Data Transformation

Aggregation: summarization (data reduction)

E.g. Daily sales -> monthly sales

(Statistical) Normalization: scaled to fall within a small, specified

range

E.g. income vs. age

Discretization and generalization

E.g. age -> youth, middle-aged, senior

Attribute construction: construct new attributes from given ones

E.g. birthday -> age

Page 4: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 4

Data Aggregation A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values

Data cubes store multidimensional aggregated information

Multiple levels of aggregation for analysis at multiple granularities

Page 5: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 5

Normalization

scaled to fall within a small, specified range

Min-max normalization: [minA, maxA] to [new_minA, new_maxA]

Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then

$73,000 is mapped to

Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then

Normalization by decimal scaling

716.00)00.1(000,12000,98

000,12600,73

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

A

Avv

'

j

vv

10' Where j is the smallest integer such that Max(|ν’|) < 1

225.1000,16

000,54600,73

Page 6: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 6

Discretization and Generalization

Discretization: transform continuous attribute into discrete

counterparts (intervals)

Supervised vs. unsupervised

Split (top-down) vs. merge (bottom-up)

Generalization: generalize/replace low level concepts (such as age

ranges) by higher level concepts (such as young, middle-aged, or

senior)

Page 7: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 7

Discretization Methods

Binning or histogram analysis

Unsupervised, top-down split

Clustering analysis

Unsupervised, either top-down split or

bottom-up

Entropy-based discretization

Supervised, top-down split

Page 8: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Entropy based on class distribution of the samples in a set S1 : m classes, pi is

the probability of class i in S1

Given a set of samples S, if S is partitioned into two intervals S1 and S2 using

boundary T, the class entropy after partitioning is

The boundary that minimizes the entropy function is selected for binary

discretization

The process is recursively

applied to partitions

September 5, 2013 8

Entropy-Based Discretization

)(||

||)(

||

||),( 2

21

1SEntropy

S

SSEntropy

S

STSI

m

i

ii ppSEntropy1

21 )(log)(

Page 9: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 9

Generalization for Categorical Attributes

Specification of a partial/total ordering of attributes explicitly at the

schema level by users or experts

street < city < state < country

Specification of a hierarchy for a set of values by explicit data

grouping

{Atlanta, Savannah, Columbus} < Georgia

Automatic generation of hierarchies (or attribute levels) by the

analysis of the number of distinct values

E.g., for a set of attributes: {street, city, state, country}

Page 10: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 10

Automatic Concept Hierarchy Generation

Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set

The attribute with the most distinct values is placed at the lowest level of the hierarchy

Exceptions, e.g., weekday, month, quarter, year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values

Page 11: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Data Mining: Concepts and Techniques 11

Data Exploration and Data Preprocessing

Data and Attributes

Data exploration

Data pre-processing

Data cleaning

Data integration

Data transformation

Data reduction

Page 12: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 12

Data Reduction

Why data reduction?

A database/data warehouse may store terabytes of data Number of data points

Number of dimensions

Complex data analysis/mining may take a very long time to run on the complete data set

Data reduction

Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results

Page 13: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Data Reduction

Instance reduction

Sampling (instance selection)

Aggregation

Parametric reduction

Dimension reduction

Feature selection

Feature extraction

13

Page 14: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 14

Instance Reduction: Sampling

Sampling: obtaining a small representative sample s to represent the whole data set N

A sample is representative if it has approximately the same property (of interest) as the original set of data

Statisticians sample because obtaining the entire set of data is too expensive or time consuming.

Data miners sample because processing the entire set of data is too expensive or time consuming

Issues:

Sampling method

Sampling size

Page 15: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Why sampling

15

A statistics professor was describing sampling theory

Student: I don’t believe it, why not study the whole population in the first place?

The professor continued explaining sampling methods, the central limit theorem, etc.

Student: Too much theory, too risky, I couldn’t trust just a few numbers in place of ALL of them.

The professor explained the Nielsen television ratings

Student: You mean that just a sample of a few thousand can tell us exactly what over 250 MILLION people are doing?

Professor: Well, the next time you go to the campus clinic and they want to do a blood test…tell them that’s not good enough …tell them to TAKE IT ALL!!”

Page 16: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Sampling Methods

Simple Random Sampling There is an equal probability of selecting any particular item

Stratified sampling Split the data into several partitions (stratum); then draw random

samples from each partition

Cluster sampling When "natural" groupings are evident in a statistical population

Sampling without replacement As each item is selected, it is removed from the population

Sampling with replacement Objects are not removed from the population as they are selected

for the sample - the same object can be picked up more than once

Page 17: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 17

Simple random sampling without or with replacement

Raw Data

SRSWOR

(simple random

sample without

replacement)

Final Data

Raw Data

SRSWR

(simple random

sample with

replacement)

Final Data

Page 18: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 18

Stratified Sampling Illustration

Raw Data Stratified Sample

Page 19: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Sampling size

19

Page 20: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Sampling Size

8000 points 2000 Points 500 Points

Page 21: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Data Reduction

Instance reduction

Sampling (instance selection)

Numerosity reduction

Dimension reduction

Feature selection

Feature extraction

21

Page 22: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 22

Numerosity Reduction

Reduce data volume by choosing alternative, smaller forms of data representation

Parametric methods

Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)

Regression

Non-parametric methods

Do not assume models

Major families: histograms, clustering

Page 23: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Regression Analysis

Assume the data fits some model and estimate model

parameters

Multiple linear regression: Y = b0 + b1X1 + … + bPXP

Line fitting: Y = b1X + b0

Polynomial fitting: Y = b2x2 + b1x + b0

Regression techniques

Least square fitting

Vertical vs. perpendicular offsets

Outliers

Robust regression (when there

are many outliers)

Page 24: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 24

Instance Reduction: Histograms

Divide data into buckets (bins) and store average (sum) for each bucket

Partitioning rules:

Equi-width: equal bucket range

Equi-depth: equal frequency

V-optimal: with the least frequency variance

http://www.mathcs.emory.edu/~cheung/Courses/584-StreamDB/Syllabus/06-

Histograms/v-opt1.html

http://www.mathcs.emory.edu/~cheung/Courses/584-StreamDB/Syllabus/06-

Histograms/v-opt2.html

http://www.mathcs.emory.edu/~cheung/Courses/584-StreamDB/Syllabus/06-

Histograms/v-opt3.html

Page 25: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 25

Instance Reduction: Clustering

Partition data set into clusters based on similarity, and store cluster

representation (e.g., centroid and diameter) only

Can be very effective if data is clustered but not if data is “smeared”

Can have hierarchical clustering and be stored in multi-dimensional

index tree structures

Cluster analysis will be studied in depth later

Page 26: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Data Reduction

Instance reduction

Sampling (instance selection)

Numerosity reduction

Dimension reduction

Feature selection

Feature extraction

26

Page 27: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Feature Subset Selection

Select a subset of features such that the resulting data does not affect mining result

Redundant features

duplicate much or all of the information contained in one or more other attributes

Example: purchase price of a product and the amount of sales tax paid

Irrelevant features

contain no information that is useful for the data mining task at hand

Example: students' ID is often irrelevant to the task of predicting students' GPA

Page 28: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Correlation between attributes

28

Correlation measures the linear relationship between objects

Page 29: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 29

Correlation Analysis (Numerical Data)

Correlation coefficient (also called Pearson’s product

moment coefficient)

where n is the number of tuples, and are the respective means

of A and B, σA and σB are the respective standard deviation of A and

B, and Σ(AB) is the sum of the AB cross-product.

rA,B > 0, A and B are positively correlated (A’s values increase as B’s)

rA,B = 0: independent

rA,B < 0: negatively correlated

BABA n

BAnAB

n

BBAAr BA

)1(

)(

)1(

))((,

A B

Page 30: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Visually Evaluating Correlation

Scatter plots showing the Pearson correlation from –1 to 1.

Page 31: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 31

Correlation Analysis (Categorical Data)

Χ2 (chi-square) test

The larger the Χ2 value, the more likely the variables are

related

The cells that contribute the most to the Χ2 value are

those whose actual count is very different from the

expected count

Expected

ExpectedObserved 22 )(

Page 32: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 32

Chi-Square Calculation: An Example

Χ2 (chi-square) calculation (numbers in parenthesis are

expected counts calculated based on the data distribution

in the two categories)

It shows that like_science_fiction and play_chess are

correlated in the group (10.828 needed to reject the

independence hypothesis)

93.507840

)8401000(

360

)360200(

210

)21050(

90

)90250( 22222

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

Page 33: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Feature Selection

Brute-force approach:

Try all possible feature subsets

Heuristic methods

Step-wise forward selection

Step-wise backward elimination

Combining forward selection and backward elimination

Page 34: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Feature Selection Filter approaches:

Features are selected independent of data mining algorithm (before)

E.g. Minimal pair-wise correlation/dependence, top k information entropy

Wrapper approaches:

Use the data mining algorithm as a black box to find best subset

E.g. best classification accuracy

Embedded approaches:

Feature selection occurs naturally as part of the data mining algorithm – algorithm decides which attribute to select

E.g. Decision tree classification

34

Page 35: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

Data Reduction

Instance reduction

Sampling

Aggregation

Dimension reduction

Feature selection

Feature extraction/creation

35

Page 36: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 36

Feature Extraction

Create new features (attributes) by combining/mapping existing ones

Methods

Principle Component Analysis

Data compression methods – Discrete Wavelet Transform

Regression analysis

Page 37: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 37

Principle component analysis: find the dimensions that capture the most variance

A linear mapping of the data to a new coordinate system such that the greatest variance lies on the first coordinate (the first principal component), the second greatest variance on the second coordinate, and so on.

Steps

Normalize input data: each attribute falls within the same range

Compute k orthonormal (unit) vectors, i.e., principal components - each input data (vector) is a linear combination of the k principal component vectors

The principal components are sorted in order of decreasing “significance”

Weak components can be eliminated, i.e., those with low variance

Principal Component Analysis (PCA)

Page 38: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 38

X1

X2

Y1

Y2

Illustration of Principal Component Analysis

Page 39: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 39

Example of Principal Component Analysis for biological data

Page 40: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 40

Data Compression

Data compression: reduced representation of original data

Lossless vs. lossy

Common lossless techniques (string)

Run-length encoding

Entropy encoding – Huffman encoding, arithmetic encoding

Common lossy techniques (audio/video)

Discrete cosine transform

Wavelet transform

Original Data Compressed

Data

lossless

Original Data

Approximated

Page 41: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 41

Wavelet Transformation

Discrete wavelet transform (DWT): linear signal processing technique

divides signal into different frequency components

Data compression/reduction: store only a small fraction of the

strongest of the wavelet coefficients

Discrete wavelet functions

Haar wavelet

Daubechies wavelets

Page 42: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

DWT Algorithm

Pyramid algorithm - averaging and differencing method

Input data of length L (an integer power of 2)

Each transform has 2 functions: smoothing (sum, avg), then (weighted)

differencing

Applies to pairs of data, resulting in two set of data of length L/2

Applies two functions recursively, until reaches the desired length

Select coefficients by threshold

Haar Wavelet Transform

Haar matrix (sum and difference):

Example: (4,6,10,8,1,9,5,3)

Filtering of data

Low pass filter (averaging)

High pass filter (differencing)

42

Page 43: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

September 5, 2013 43

Example of DWT Based Image Compression

DWT compression for test image Lenna (threshold = 1)

Page 44: CS570: Introduction to Data Mining - Emory Universitycengiz/cs570-data-mining-fa... · 11 CS570: Introduction to Data Mining Fall 2013 Reading: Chapter 3 Han, Chapter 2 Tan Anca Doloc-Mihu,

44

Summary

Data Exploration and Data Preprocessing

Data and Attributes

Data exploration

Descriptive statistics

Data visualization

Data pre-processing

Data cleaning

Data integration

Data transformation

Data reduction

Next lecture

Frequent itemsets mining and association analysis