cs570: introduction to data mining - emory universitycengiz/cs570-data-mining-fa... · 11 cs570:...

1 1

CS570: Introduction to Data Mining

Fall 2013

Reading: Chapter 3 Han, Chapter 2 Tan

Anca Doloc-Mihu, Ph.D.

Some slides courtesy of Li Xiong, Ph.D. and

©2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.

2

Data Exploration and Data Preprocessing

Data and Attributes

Data exploration

Data pre-processing

Data cleaning

Data integration

Data transformation

Data reduction

September 5, 2013 3

Data Transformation

Aggregation: summarization (data reduction)

E.g. Daily sales -> monthly sales

(Statistical) Normalization: scaled to fall within a small, specified

range

E.g. income vs. age

Discretization and generalization

E.g. age -> youth, middle-aged, senior

Attribute construction: construct new attributes from given ones

E.g. birthday -> age

September 5, 2013 4

Data Aggregation A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values

Data cubes store multidimensional aggregated information

Multiple levels of aggregation for analysis at multiple granularities

September 5, 2013 5

Normalization

scaled to fall within a small, specified range

Min-max normalization: [minA, maxA] to [new_minA, new_maxA]

Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then

$73,000 is mapped to

Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then

Normalization by decimal scaling

716.00)00.1(000,12000,98

000,12600,73

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

A

Avv

'

j

vv

10' Where j is the smallest integer such that Max(|ν’|) < 1

225.1000,16

000,54600,73

September 5, 2013 6

Discretization and Generalization

Discretization: transform continuous attribute into discrete

counterparts (intervals)

Supervised vs. unsupervised

Split (top-down) vs. merge (bottom-up)

Generalization: generalize/replace low level concepts (such as age

ranges) by higher level concepts (such as young, middle-aged, or

senior)

September 5, 2013 7

Discretization Methods

Binning or histogram analysis

Unsupervised, top-down split

Clustering analysis

Unsupervised, either top-down split or

bottom-up

Entropy-based discretization

Supervised, top-down split

Entropy based on class distribution of the samples in a set S1 : m classes, pi is

the probability of class i in S1

Given a set of samples S, if S is partitioned into two intervals S1 and S2 using

boundary T, the class entropy after partitioning is

The boundary that minimizes the entropy function is selected for binary

discretization

The process is recursively

applied to partitions

September 5, 2013 8

Entropy-Based Discretization

)(||

||)(

||

||),( 2

21

1SEntropy

S

SSEntropy

S

STSI

m

i

ii ppSEntropy1

21 )(log)(

September 5, 2013 9

Generalization for Categorical Attributes

Specification of a partial/total ordering of attributes explicitly at the

schema level by users or experts

street < city < state < country

Specification of a hierarchy for a set of values by explicit data

grouping

{Atlanta, Savannah, Columbus} < Georgia

Automatic generation of hierarchies (or attribute levels) by the

analysis of the number of distinct values

E.g., for a set of attributes: {street, city, state, country}

September 5, 2013 10

Automatic Concept Hierarchy Generation

Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set

The attribute with the most distinct values is placed at the lowest level of the hierarchy

Exceptions, e.g., weekday, month, quarter, year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values

Data Mining: Concepts and Techniques 11


Data and Attributes

Data exploration

Data pre-processing

Data cleaning

Data integration

Data transformation

Data reduction


Data Reduction

Why data reduction?

A database/data warehouse may store terabytes of data Number of data points

Number of dimensions

Complex data analysis/mining may take a very long time to run on the complete data set

Data reduction

Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results

Data Reduction

Instance reduction

Sampling (instance selection)

Aggregation

Parametric reduction

Dimension reduction

Feature selection

Feature extraction

13


Instance Reduction: Sampling

Sampling: obtaining a small representative sample s to represent the whole data set N

A sample is representative if it has approximately the same property (of interest) as the original set of data

Statisticians sample because obtaining the entire set of data is too expensive or time consuming.

Data miners sample because processing the entire set of data is too expensive or time consuming

Issues:

Sampling method

Sampling size

Why sampling

15

A statistics professor was describing sampling theory

Student: I don’t believe it, why not study the whole population in the first place?

The professor continued explaining sampling methods, the central limit theorem, etc.

Student: Too much theory, too risky, I couldn’t trust just a few numbers in place of ALL of them.

The professor explained the Nielsen television ratings

Student: You mean that just a sample of a few thousand can tell us exactly what over 250 MILLION people are doing?

Professor: Well, the next time you go to the campus clinic and they want to do a blood test…tell them that’s not good enough …tell them to TAKE IT ALL!!”

Sampling Methods

Simple Random Sampling There is an equal probability of selecting any particular item

Stratified sampling Split the data into several partitions (stratum); then draw random

samples from each partition

Cluster sampling When "natural" groupings are evident in a statistical population

Sampling without replacement As each item is selected, it is removed from the population

Sampling with replacement Objects are not removed from the population as they are selected

for the sample - the same object can be picked up more than once


Simple random sampling without or with replacement

Raw Data

SRSWOR

(simple random

sample without

replacement)

Final Data

Raw Data

SRSWR

(simple random

sample with

replacement)

Final Data


Stratified Sampling Illustration

Raw Data Stratified Sample

Sampling size

19

Sampling Size

8000 points 2000 Points 500 Points

Data Reduction

Instance reduction


Numerosity reduction

Dimension reduction

Feature selection

Feature extraction

21


Numerosity Reduction

Reduce data volume by choosing alternative, smaller forms of data representation

Parametric methods

Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)

Regression

Non-parametric methods

Do not assume models

Major families: histograms, clustering

Regression Analysis

Assume the data fits some model and estimate model

parameters

Multiple linear regression: Y = b0 + b1X1 + … + bPXP

Line fitting: Y = b1X + b0

Polynomial fitting: Y = b2x2 + b1x + b0

Regression techniques

Least square fitting

Vertical vs. perpendicular offsets

Outliers

Robust regression (when there

are many outliers)


Instance Reduction: Histograms

Divide data into buckets (bins) and store average (sum) for each bucket

Partitioning rules:

Equi-width: equal bucket range

Equi-depth: equal frequency

V-optimal: with the least frequency variance

http://www.mathcs.emory.edu/~cheung/Courses/584-StreamDB/Syllabus/06-

Histograms/v-opt1.html





http://www.mathcs.emory.edu/~cheung/Courses/584-StreamDB/Syllabus/06-Histograms/v-opt1.html






















Instance Reduction: Clustering

Partition data set into clusters based on similarity, and store cluster

representation (e.g., centroid and diameter) only

Can be very effective if data is clustered but not if data is “smeared”

Can have hierarchical clustering and be stored in multi-dimensional

index tree structures

Cluster analysis will be studied in depth later

Data Reduction

Instance reduction


Numerosity reduction

Dimension reduction

Feature selection

Feature extraction

26

Feature Subset Selection

Select a subset of features such that the resulting data does not affect mining result

Redundant features

duplicate much or all of the information contained in one or more other attributes

Example: purchase price of a product and the amount of sales tax paid

Irrelevant features

contain no information that is useful for the data mining task at hand

Example: students' ID is often irrelevant to the task of predicting students' GPA

Correlation between attributes

28

Correlation measures the linear relationship between objects


Correlation Analysis (Numerical Data)

Correlation coefficient (also called Pearson’s product

moment coefficient)

where n is the number of tuples, and are the respective means

of A and B, σA and σB are the respective standard deviation of A and

B, and Σ(AB) is the sum of the AB cross-product.

rA,B > 0, A and B are positively correlated (A’s values increase as B’s)

rA,B = 0: independent

rA,B < 0: negatively correlated

BABA n

BAnAB

n

BBAAr BA

)1(

)(

)1(

))((,

A B

Visually Evaluating Correlation

Scatter plots showing the Pearson correlation from –1 to 1.


Correlation Analysis (Categorical Data)

Χ2 (chi-square) test

The larger the Χ2 value, the more likely the variables are

related

The cells that contribute the most to the Χ2 value are

those whose actual count is very different from the

expected count

Expected

ExpectedObserved 22 )(


Chi-Square Calculation: An Example

Χ2 (chi-square) calculation (numbers in parenthesis are

expected counts calculated based on the data distribution

in the two categories)

It shows that like_science_fiction and play_chess are

correlated in the group (10.828 needed to reject the

independence hypothesis)

93.507840

)8401000(

360

)360200(

210

)21050(

90

)90250( 22222

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

Feature Selection

Brute-force approach:

Try all possible feature subsets

Heuristic methods

Step-wise forward selection

Step-wise backward elimination

Combining forward selection and backward elimination

Feature Selection Filter approaches:

Features are selected independent of data mining algorithm (before)

E.g. Minimal pair-wise correlation/dependence, top k information entropy

Wrapper approaches:

Use the data mining algorithm as a black box to find best subset

E.g. best classification accuracy

Embedded approaches:

Feature selection occurs naturally as part of the data mining algorithm – algorithm decides which attribute to select

E.g. Decision tree classification

34

Data Reduction

Instance reduction

Sampling

Aggregation

Dimension reduction

Feature selection

Feature extraction/creation

35


Feature Extraction

Create new features (attributes) by combining/mapping existing ones

Methods

Principle Component Analysis

Data compression methods – Discrete Wavelet Transform

Regression analysis


Principle component analysis: find the dimensions that capture the most variance

A linear mapping of the data to a new coordinate system such that the greatest variance lies on the first coordinate (the first principal component), the second greatest variance on the second coordinate, and so on.

Steps

Normalize input data: each attribute falls within the same range

Compute k orthonormal (unit) vectors, i.e., principal components - each input data (vector) is a linear combination of the k principal component vectors

The principal components are sorted in order of decreasing “significance”

Weak components can be eliminated, i.e., those with low variance

Principal Component Analysis (PCA)


X1

X2

Y1

Y2

Illustration of Principal Component Analysis


Example of Principal Component Analysis for biological data


Data Compression

Data compression: reduced representation of original data

Lossless vs. lossy

Common lossless techniques (string)

Run-length encoding

Entropy encoding – Huffman encoding, arithmetic encoding

Common lossy techniques (audio/video)

Discrete cosine transform

Wavelet transform

Original Data Compressed

Data

lossless

Original Data

Approximated


Wavelet Transformation

Discrete wavelet transform (DWT): linear signal processing technique

divides signal into different frequency components

Data compression/reduction: store only a small fraction of the

strongest of the wavelet coefficients

Discrete wavelet functions

Haar wavelet

Daubechies wavelets

DWT Algorithm

Pyramid algorithm - averaging and differencing method

Input data of length L (an integer power of 2)

Each transform has 2 functions: smoothing (sum, avg), then (weighted)

differencing

Applies to pairs of data, resulting in two set of data of length L/2

Applies two functions recursively, until reaches the desired length

Select coefficients by threshold

Haar Wavelet Transform

Haar matrix (sum and difference):

Example: (4,6,10,8,1,9,5,3)

Filtering of data

Low pass filter (averaging)

High pass filter (differencing)

42


Example of DWT Based Image Compression

DWT compression for test image Lenna (threshold = 1)

44

Summary


Data and Attributes

Data exploration

Descriptive statistics

Data visualization

Data pre-processing

Data cleaning

Data integration

Data transformation

Data reduction

Next lecture

Frequent itemsets mining and association analysis

cs570: introduction to data mining - emory universitycengiz/cs570-data-mining-fa... · 11 cs570:...

Documents