cs570: introduction to data mining - emory universitycengiz/cs570-data-mining-fa... · 11 cs570:...
TRANSCRIPT
1 1
CS570: Introduction to Data Mining
Fall 2013
Reading: Chapter 3 Han, Chapter 2 Tan
Anca Doloc-Mihu, Ph.D.
Some slides courtesy of Li Xiong, Ph.D. and
©2011 Han, Kamber & Pei. Data Mining. Morgan Kaufmann.
2
Data Exploration and Data Preprocessing
Data and Attributes
Data exploration
Data pre-processing
Data cleaning
Data integration
Data transformation
Data reduction
September 5, 2013 3
Data Transformation
Aggregation: summarization (data reduction)
E.g. Daily sales -> monthly sales
(Statistical) Normalization: scaled to fall within a small, specified
range
E.g. income vs. age
Discretization and generalization
E.g. age -> youth, middle-aged, senior
Attribute construction: construct new attributes from given ones
E.g. birthday -> age
September 5, 2013 4
Data Aggregation A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values
Data cubes store multidimensional aggregated information
Multiple levels of aggregation for analysis at multiple granularities
September 5, 2013 5
Normalization
scaled to fall within a small, specified range
Min-max normalization: [minA, maxA] to [new_minA, new_maxA]
Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then
$73,000 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
Ex. Let μ = 54,000, σ = 16,000. Then
Normalization by decimal scaling
716.00)00.1(000,12000,98
000,12600,73
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__('
A
Avv
'
j
vv
10' Where j is the smallest integer such that Max(|ν’|) < 1
225.1000,16
000,54600,73
September 5, 2013 6
Discretization and Generalization
Discretization: transform continuous attribute into discrete
counterparts (intervals)
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Generalization: generalize/replace low level concepts (such as age
ranges) by higher level concepts (such as young, middle-aged, or
senior)
September 5, 2013 7
Discretization Methods
Binning or histogram analysis
Unsupervised, top-down split
Clustering analysis
Unsupervised, either top-down split or
bottom-up
Entropy-based discretization
Supervised, top-down split
Entropy based on class distribution of the samples in a set S1 : m classes, pi is
the probability of class i in S1
Given a set of samples S, if S is partitioned into two intervals S1 and S2 using
boundary T, the class entropy after partitioning is
The boundary that minimizes the entropy function is selected for binary
discretization
The process is recursively
applied to partitions
September 5, 2013 8
Entropy-Based Discretization
)(||
||)(
||
||),( 2
21
1SEntropy
S
SSEntropy
S
STSI
m
i
ii ppSEntropy1
21 )(log)(
September 5, 2013 9
Generalization for Categorical Attributes
Specification of a partial/total ordering of attributes explicitly at the
schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit data
grouping
{Atlanta, Savannah, Columbus} < Georgia
Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
September 5, 2013 10
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set
The attribute with the most distinct values is placed at the lowest level of the hierarchy
Exceptions, e.g., weekday, month, quarter, year
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
Data Mining: Concepts and Techniques 11
Data Exploration and Data Preprocessing
Data and Attributes
Data exploration
Data pre-processing
Data cleaning
Data integration
Data transformation
Data reduction
September 5, 2013 12
Data Reduction
Why data reduction?
A database/data warehouse may store terabytes of data Number of data points
Number of dimensions
Complex data analysis/mining may take a very long time to run on the complete data set
Data reduction
Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results
Data Reduction
Instance reduction
Sampling (instance selection)
Aggregation
Parametric reduction
Dimension reduction
Feature selection
Feature extraction
13
September 5, 2013 14
Instance Reduction: Sampling
Sampling: obtaining a small representative sample s to represent the whole data set N
A sample is representative if it has approximately the same property (of interest) as the original set of data
Statisticians sample because obtaining the entire set of data is too expensive or time consuming.
Data miners sample because processing the entire set of data is too expensive or time consuming
Issues:
Sampling method
Sampling size
Why sampling
15
A statistics professor was describing sampling theory
Student: I don’t believe it, why not study the whole population in the first place?
The professor continued explaining sampling methods, the central limit theorem, etc.
Student: Too much theory, too risky, I couldn’t trust just a few numbers in place of ALL of them.
The professor explained the Nielsen television ratings
Student: You mean that just a sample of a few thousand can tell us exactly what over 250 MILLION people are doing?
Professor: Well, the next time you go to the campus clinic and they want to do a blood test…tell them that’s not good enough …tell them to TAKE IT ALL!!”
Sampling Methods
Simple Random Sampling There is an equal probability of selecting any particular item
Stratified sampling Split the data into several partitions (stratum); then draw random
samples from each partition
Cluster sampling When "natural" groupings are evident in a statistical population
Sampling without replacement As each item is selected, it is removed from the population
Sampling with replacement Objects are not removed from the population as they are selected
for the sample - the same object can be picked up more than once
September 5, 2013 17
Simple random sampling without or with replacement
Raw Data
SRSWOR
(simple random
sample without
replacement)
Final Data
Raw Data
SRSWR
(simple random
sample with
replacement)
Final Data
September 5, 2013 18
Stratified Sampling Illustration
Raw Data Stratified Sample
Sampling size
19
Sampling Size
8000 points 2000 Points 500 Points
Data Reduction
Instance reduction
Sampling (instance selection)
Numerosity reduction
Dimension reduction
Feature selection
Feature extraction
21
September 5, 2013 22
Numerosity Reduction
Reduce data volume by choosing alternative, smaller forms of data representation
Parametric methods
Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)
Regression
Non-parametric methods
Do not assume models
Major families: histograms, clustering
Regression Analysis
Assume the data fits some model and estimate model
parameters
Multiple linear regression: Y = b0 + b1X1 + … + bPXP
Line fitting: Y = b1X + b0
Polynomial fitting: Y = b2x2 + b1x + b0
Regression techniques
Least square fitting
Vertical vs. perpendicular offsets
Outliers
Robust regression (when there
are many outliers)
September 5, 2013 24
Instance Reduction: Histograms
Divide data into buckets (bins) and store average (sum) for each bucket
Partitioning rules:
Equi-width: equal bucket range
Equi-depth: equal frequency
V-optimal: with the least frequency variance
http://www.mathcs.emory.edu/~cheung/Courses/584-StreamDB/Syllabus/06-
Histograms/v-opt1.html
http://www.mathcs.emory.edu/~cheung/Courses/584-StreamDB/Syllabus/06-
Histograms/v-opt2.html
http://www.mathcs.emory.edu/~cheung/Courses/584-StreamDB/Syllabus/06-
Histograms/v-opt3.html
September 5, 2013 25
Instance Reduction: Clustering
Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is “smeared”
Can have hierarchical clustering and be stored in multi-dimensional
index tree structures
Cluster analysis will be studied in depth later
Data Reduction
Instance reduction
Sampling (instance selection)
Numerosity reduction
Dimension reduction
Feature selection
Feature extraction
26
Feature Subset Selection
Select a subset of features such that the resulting data does not affect mining result
Redundant features
duplicate much or all of the information contained in one or more other attributes
Example: purchase price of a product and the amount of sales tax paid
Irrelevant features
contain no information that is useful for the data mining task at hand
Example: students' ID is often irrelevant to the task of predicting students' GPA
Correlation between attributes
28
Correlation measures the linear relationship between objects
September 5, 2013 29
Correlation Analysis (Numerical Data)
Correlation coefficient (also called Pearson’s product
moment coefficient)
where n is the number of tuples, and are the respective means
of A and B, σA and σB are the respective standard deviation of A and
B, and Σ(AB) is the sum of the AB cross-product.
rA,B > 0, A and B are positively correlated (A’s values increase as B’s)
rA,B = 0: independent
rA,B < 0: negatively correlated
BABA n
BAnAB
n
BBAAr BA
)1(
)(
)1(
))((,
A B
Visually Evaluating Correlation
Scatter plots showing the Pearson correlation from –1 to 1.
September 5, 2013 31
Correlation Analysis (Categorical Data)
Χ2 (chi-square) test
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Expected
ExpectedObserved 22 )(
September 5, 2013 32
Chi-Square Calculation: An Example
Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
It shows that like_science_fiction and play_chess are
correlated in the group (10.828 needed to reject the
independence hypothesis)
93.507840
)8401000(
360
)360200(
210
)21050(
90
)90250( 22222
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
Feature Selection
Brute-force approach:
Try all possible feature subsets
Heuristic methods
Step-wise forward selection
Step-wise backward elimination
Combining forward selection and backward elimination
Feature Selection Filter approaches:
Features are selected independent of data mining algorithm (before)
E.g. Minimal pair-wise correlation/dependence, top k information entropy
Wrapper approaches:
Use the data mining algorithm as a black box to find best subset
E.g. best classification accuracy
Embedded approaches:
Feature selection occurs naturally as part of the data mining algorithm – algorithm decides which attribute to select
E.g. Decision tree classification
34
Data Reduction
Instance reduction
Sampling
Aggregation
Dimension reduction
Feature selection
Feature extraction/creation
35
September 5, 2013 36
Feature Extraction
Create new features (attributes) by combining/mapping existing ones
Methods
Principle Component Analysis
Data compression methods – Discrete Wavelet Transform
Regression analysis
September 5, 2013 37
Principle component analysis: find the dimensions that capture the most variance
A linear mapping of the data to a new coordinate system such that the greatest variance lies on the first coordinate (the first principal component), the second greatest variance on the second coordinate, and so on.
Steps
Normalize input data: each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components - each input data (vector) is a linear combination of the k principal component vectors
The principal components are sorted in order of decreasing “significance”
Weak components can be eliminated, i.e., those with low variance
Principal Component Analysis (PCA)
September 5, 2013 38
X1
X2
Y1
Y2
Illustration of Principal Component Analysis
September 5, 2013 39
Example of Principal Component Analysis for biological data
September 5, 2013 40
Data Compression
Data compression: reduced representation of original data
Lossless vs. lossy
Common lossless techniques (string)
Run-length encoding
Entropy encoding – Huffman encoding, arithmetic encoding
Common lossy techniques (audio/video)
Discrete cosine transform
Wavelet transform
Original Data Compressed
Data
lossless
Original Data
Approximated
September 5, 2013 41
Wavelet Transformation
Discrete wavelet transform (DWT): linear signal processing technique
divides signal into different frequency components
Data compression/reduction: store only a small fraction of the
strongest of the wavelet coefficients
Discrete wavelet functions
Haar wavelet
Daubechies wavelets
DWT Algorithm
Pyramid algorithm - averaging and differencing method
Input data of length L (an integer power of 2)
Each transform has 2 functions: smoothing (sum, avg), then (weighted)
differencing
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
Select coefficients by threshold
Haar Wavelet Transform
Haar matrix (sum and difference):
Example: (4,6,10,8,1,9,5,3)
Filtering of data
Low pass filter (averaging)
High pass filter (differencing)
42
September 5, 2013 43
Example of DWT Based Image Compression
DWT compression for test image Lenna (threshold = 1)
44
Summary
Data Exploration and Data Preprocessing
Data and Attributes
Data exploration
Descriptive statistics
Data visualization
Data pre-processing
Data cleaning
Data integration
Data transformation
Data reduction
Next lecture
Frequent itemsets mining and association analysis