cs570 introduction to data mining - emory university scaled to fall within a small, specified range...
TRANSCRIPT
Data Exploration and Data Preprocessing
� Data and Attributes
� Data exploration
� Data pre-processing
Data Mining: Concepts and Techniques 2
� Data cleaning
� Data integration
� Data transformation
� Data reduction
Data Transformation
� Aggregation: summarization (data reduction)
� E.g. Daily sales -> monthly sales
� Discretization and generalization
� E.g. age -> youth, middle-aged, senior
� (Statistical) Normalization: scaled to fall within a small, specified
range
January 25, 2011 3
range
� E.g. income vs. age
� Attribute construction: construct new attributes from given ones
� E.g. birthday -> age
Data Aggregation
� Data cubes store multidimensional aggregated information
� Multiple levels of aggregation for analysis at multiple granularities
January 25, 2011 4
Normalization
� scaled to fall within a small, specified range
� Min-max normalization: [minA, maxA] to [new_minA, new_maxA]
� Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then
$73,000 is mapped to
� Z-score normalization (µ: mean, σ: standard deviation):
716.00)00.1(000,12000,98
000,12600,73=+−
−
−
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__(' +−
−
−=
January 25, 2011 5
� Z-score normalization (µ: mean, σ: standard deviation):
� Ex. Let µ = 54,000, σ = 16,000. Then
� Normalization by decimal scaling
A
Avv
σ
µ−='
j
vv
10'= Where j is the smallest integer such that Max(|ν’|) < 1
225.1000,16
000,54600,73=
−
Discretization and Generalization
� Discretization: transform continuous attribute into discrete
counterparts (intervals)
� Supervised vs. unsupervised
� Split (top-down) vs. merge (bottom-up)
� Generalization: generalize/replace low level concepts (such as age
January 25, 2011 6
� Generalization: generalize/replace low level concepts (such as age
ranges) by higher level concepts (such as young, middle-aged, or
senior)
Discretization Methods
� Binning or histogram analysis
� Unsupervised, top-down split
� Clustering analysis
� Unsupervised, either top-down split or bottom-up
January 25, 2011 7
� Entropy-based discretization
� Supervised, top-down split
� Entropy based on class distribution of the samples in a set S1 : m classes, pi is
the probability of class i in S1
� Given a set of samples S, if S is partitioned into two intervals S1 and S2 using
boundary T, the class entropy after partitioning is
Entropy-Based Discretization
|||| 21 SS+=
∑=
−=m
i
ii ppSEntropy1
21 )(log)(
� The boundary that minimizes the entropy function is selected for binary
discretization
� The process is recursively
applied to partitions
January 25, 2011 8
)(||
||)(
||
||),( 2
21
1SEntropy
S
SSEntropy
S
STSI +=
Information Entropy� Information entropy: measure of the uncertainty associated with a
random variable.
� Quantifies the information contained in a message with minimum message length (# bits) to communicate
� Illustrative example:
9
� P(X=A) = ¼, P(X=B) = ¼, P(X=C) = ¼, P(X=D) = ¼
� BAACBADCDADDDA…
� Minimum 2 bits (e.g. A = 00, B = 01, C = 10, D = 11)
� 0100001001001110110011111100…
� What if P(X=A) = ½, P(X=B) = ¼, P(X=C) = 1/8, P(X=D) = 1/8
� Minimum # of bits?
� High entropy vs. low entropy
E.g. A = 0, B = 10, C= 110, D = 111
Generalization for Categorical Attributes
� Specification of a partial/total ordering of attributes explicitly at the
schema level by users or experts
� street < city < state < country
� Specification of a hierarchy for a set of values by explicit data
grouping
� {Atlanta, Savannah, Columbus} < Georgia
January 25, 2011 10
� {Atlanta, Savannah, Columbus} < Georgia
� Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
� E.g., for a set of attributes: {street, city, state, country}
Automatic Concept Hierarchy Generation
� Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set
� The attribute with the most distinct values is placed at the lowest level of the hierarchy
� Exceptions, e.g., weekday, month, quarter, year
January 25, 2011 11
country
province_or_ state
city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
Data Exploration and Data Preprocessing
� Data and Attributes
� Data exploration
� Data pre-processing
Data Mining: Concepts and Techniques 12
� Data cleaning
� Data integration
� Data transformation
� Data reduction
Data Reduction
� Why data reduction?
� A database/data warehouse may store terabytes of data� Number of data points
� Number of dimensions
� Complex data analysis/mining may take a very long time to run on the complete data set
Data reduction
January 25, 2011 13
� Data reduction
� Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results
Data Reduction
� Instance reduction
� Sampling (instance selection)
� Numerosity reduction
� Dimension reduction
� Feature selection� Feature selection
� Feature extraction
14
Instance Reduction: Sampling
� Sampling: obtaining a small representative sample s to represent the whole data set N
� A sample is representative if it has approximately the same property (of interest) as the original set of data
� Statisticians sample because obtaining the entire set of data is too expensive or time consuming.
January 25, 2011 15
data is too expensive or time consuming.
� Data miners sample because processing the entire set of data is too expensive or time consuming
� Sampling method
� Sampling size
Why sampling A statistics professor was describing sampling theory
Student: I don’t believe it, why not study the whole population in the first place?
The professor continued explaining sampling methods, the central limit theorem, etc.
Student: Too much theory, too risky, I couldn’t trust just a few numbers in place of ALL of them.
The professor explained the Nielsen television
16
The professor explained the Nielsen television ratings
Student: You mean that just a sample of a few thousand can tell us exactly what over 250 MILLION people are doing?
Professor: Well, the next time you go to the campus clinic and they want to do a blood test…tell them that’s not good enough …tell them to TAKE IT ALL!!”
Sampling Methods
� Simple Random Sampling� There is an equal probability of selecting any particular item
� Stratified sampling� Split the data into several partitions (stratum); then draw random
samples from each partition
� Cluster sampling� Cluster sampling� When "natural" groupings are evident in a statistical population
� Sampling without replacement� As each item is selected, it is removed from the population
� Sampling with replacement� Objects are not removed from the population as they are selected
for the sample - the same object can be picked up more than once
Data Reduction
� Instance reduction
� Sampling (instance selection)
� Numerosity reduction
� Dimension reduction
� Feature selection� Feature selection
� Feature extraction
23
Numerosity Reduction
� Reduce data volume by choosing alternative, smaller forms of data representation
� Parametric methods
� Assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)
January 25, 2011 24
outliers)
� Regression
� Non-parametric methods
� Do not assume models
� Major families: histograms, clustering
Regress Analysis
� Assume the data fits some model and estimate model
parameters
� Linear regression: Y = b0 + b1X1 + b2X2 + … + bPXP
� Line fitting: Y = b1X + b0
� Polynomial fitting: Y = b2x2 + b1x + b0� Polynomial fitting: Y = b2x2 + b1x + b0
� Regression techniques
� Least square fitting
� Vertical vs. perpendicular offsets
� Outliers
� Robust regression
Instance Reduction: Histograms
� Divide data into buckets and store average (sum) for each bucket
� Partitioning rules:
� Equi-width: equal bucket range
� Equi-depth: equal frequency
� V-optimal: with the least frequency variance
January 25, 2011 26
Instance Reduction: Clustering
� Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
� Can be very effective if data is clustered but not if data is “smeared”
� Can have hierarchical clustering and be stored in multi-dimensional
January 25, 2011 27
index tree structures
� Cluster analysis will be studied in depth later
Data Reduction
� Instance reduction
� Sampling (instance selection)
� Numerosity reduction
� Dimension reduction
� Feature selection� Feature selection
� Feature extraction
28
Feature Subset Selection
� Select a subset of features such that the resulting data does not affect mining result
� Redundant features
� duplicate much or all of the information contained in one or more other attributes
Example: purchase price of a product and the amount � Example: purchase price of a product and the amount of sales tax paid
� Irrelevant features
� contain no information that is useful for the data mining task at hand
� Example: students' ID is often irrelevant to the task of predicting students' GPA
Correlation Analysis (Numerical Data)
� Correlation coefficient (also called Pearson’s product
moment coefficient)
BABA n
BAnAB
n
BBAAr BA
σσσσ )1(
)(
)1(
))((,
−
−=
−
−−=
∑∑
January 25, 2011 31
where n is the number of tuples, and are the respective means
of A and B, σA and σB are the respective standard deviation of A and
B, and Σ(AB) is the sum of the AB cross-product.
� rA,B > 0, A and B are positively correlated (A’s values increase as B’s)
� rA,B = 0: independent
� rA,B < 0: negatively correlated
A B
Visually Evaluating Correlation
Scatter plots showing the Pearson correlation the Pearson correlation from –1 to 1.
Correlation Analysis (Categorical Data)
� Χ2 (chi-square) test
� The larger the Χ2 value, the more likely the variables are
related
∑−
=Expected
ExpectedObserved 22 )(
χ
January 25, 2011 33
related
� The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Chi-Square Calculation: An Example
Χ2 (chi-square) calculation (numbers in parenthesis are
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
January 25, 2011 34
� Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
� It shows that like_science_fiction and play_chess are
correlated in the group (10.828 needed to reject the
independence hypothesis)
93.507840
)8401000(
360
)360200(
210
)21050(
90
)90250( 22222 =
−+
−+
−+
−=χ
Metrics of (in)dependence
� Mutual Information: mutual dependence between two attributes
� What’s the mutual information between 2 completely independent attributes?independent attributes?
� Kullback–Leibler divergence: asymmetric
35
Feature Selection
� Brute-force approach:
� Try all possible feature subsets
� Heuristic methods
� Step-wise forward selection
� Step-wise backward elimination
Combining forward selection and backward elimination� Combining forward selection and backward elimination
Feature Selection� Filter approaches:
� Features are selected independent of data mining algorithm
� E.g. Minimal pair-wise correlation/dependence, top k information entropy
� Wrapper approaches:
� Use the data mining algorithm as a black box to find best subset
� E.g. best classification accuracy� E.g. best classification accuracy
� Embedded approaches:
� Feature selection occurs naturally as part of the data mining algorithm
� E.g. Decision tree classification
37