data mining - emory universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · january 24,...

57
Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 2 — Original Slides: Jiawei Han and Micheline Kamber Modification: Li Xiong

Upload: others

Post on 17-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Data Mining: Concepts and Techniques 1

Data Mining:Concepts and Techniques

— Chapter 2 —

Original Slides: Jiawei Han and Micheline Kamber

Modification: Li Xiong

Page 2: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Data Mining: Concepts and Techniques 2

Chapter 2: Data Preprocessing

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Page 3: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Data Mining: Concepts and Techniques 3

Why Data Preprocessing?

Data in the real world is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

e.g., occupation=“ ”noisy: containing errors or outliers

e.g., Salary=“-10”inconsistent: containing discrepancies in codes or names

e.g., Age=“42” Birthday=“03/07/1997”e.g., Was rating “1,2,3”, now rating “A, B, C”e.g., discrepancy between duplicate records

Page 4: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 4

Why Is Data Dirty?

Incomplete data may come from“Not applicable” data value when collectedDifferent considerations between the time when the data was collected and when it is analyzed.Human/hardware/software problems

Noisy data (incorrect values) may come fromFaulty data collection instrumentsHuman or computer error at data entryErrors in data transmission

Inconsistent data may come fromDifferent data sourcesFunctional dependency violation (e.g., modify some linked data)

Duplicate records also need data cleaning

Page 5: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 5

Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibility

Broad categories:Intrinsic, contextual, representational, and accessibility

Page 6: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 6

Major Tasks in Data Preprocessing

Data cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integrationIntegration of multiple databases, data cubes, or files

Data transformationNormalization and aggregation

Data reductionObtains reduced representation in volume but produces the same or similar analytical results

Data discretizationPart of data reduction but with particular importance, especially for numerical data

Page 7: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Data Mining: Concepts and Techniques 7

Forms of Data Preprocessing

Page 8: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 8

Chapter 2: Data Preprocessing

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Page 9: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 9

Descriptive Data Summarization

Motivation

To better understand the data

Descriptive statistics: describe basic features of data

Graphical description

Tabular description

Summary statistics

Descriptive data summarization

Measuring central tendency – how data seem similar

Measuring statistical variability or dispersion of data – how data differ

Graphic display of descriptive data summarization

Page 10: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 10

Measuring the Central Tendency

Mean (sample vs. population):

Weighted arithmetic mean:

Trimmed mean: chopping extreme values

Median

Middle value if odd number of values, or average of the middle two

values otherwise

Estimated by interpolation (for grouped data):

Mode

Value that occurs most frequently in the data

Unimodal, bimodal, trimodal

Empirical formula:

∑=

=n

iix

nx

1

1

=

== n

ii

n

iii

w

xwx

1

1

cf

lfnLmedian

median

))(2/

(1∑−

+=

)(3 medianmeanmodemean −×=−

Nx∑=μ

Page 11: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 11

Symmetric vs. Skewed Data

Median, mean and mode of symmetric, positively and negatively skewed data

MeanMedianMode

Page 12: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 12

Computational Issues

Different types of measuresDistributed measure – can be computed by partitioning the data into smaller subsets. E.g. sum, countAlgebraic measure – can be computed by applying an algebraic function to one or more distributed measures. E.g. ?Holistic measure – must be computed on the entire dataset as a whole. E.g. ?

Selection algorithm: finding kth smallest number in a listE.g. min, max, medianSelection by sorting: O(n* logn)Linear algorithms based on quicksort: O(n)

Page 13: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 13

The Long Tail

Long tail: low-frequency population (e.g. wealth distribution)The Long Tail: the current and future business and economic models

Previous empirical studies: Amazon, NetflixProducts that are in low demand or have low sales volume can collectively make up a market share that rivals or exceeds the relatively few current bestsellers and blockbustersThe primary value of the internet: providing access to products in the long tailBusiness and social implications

mass market retailers: Amazon, Netflix, eBaycontent producers: YouTube

The Long Tail. Chris Anderson, Wired, Oct. 2004The Long Tail: Why the Future of Business is Selling Less of More. Chris Anderson. 2006

Page 14: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 14

Measuring the Dispersion of Data

Dispersion or variance: the degree to which numerical data tend to spread

Range and Quartiles

Range: difference between the largest and smallest values

Percentile: the value of a variable below which a certain percent of data fall

(algebraic or holistic?)

Quartiles: Q1 (25th percentile), Median (50th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 – Q1

Five number summary: min, Q1, M, Q3, max (Boxplot)

Outlier: usually, a value at least 1.5 x IQR higher/lower than Q3/Q1

Variance and standard deviation (sample: s, population: σ)

Variance: sample vs. population (algebraic or holistic?)

Standard deviation s (or σ) is the square root of variance s2 (or σ2)

∑ ∑∑= ==

−−

=−−

=n

i

n

iii

n

ii x

nx

nxx

ns

1 1

22

1

22 ])(1[1

1)(1

1∑∑==

−=−=n

ii

n

ii x

Nx

N 1

22

1

22 1)(1 μμσ

Page 15: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 15

Graphic Displays of Basic Statistical Descriptions

HistogramBoxplotQuantile plotQuantile-quantile (q-q) plotScatter plotLoess (local regression) curve

Page 16: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

2008�1�24���� Data Mining: Concepts and Techniques 16

Histogram Analysis

Graphical display of tabulated frequenciesunivariate graphical method (one attribute)data partitioned into disjoint buckets (typically equal-width)a set of rectangles that reflect the counts or frequencies of values at the bucketBar chart for categorical values

Page 17: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 17

Boxplot Analysis

Visualizes five-number summary:

The ends of the box are first and third quartiles (Q1 and Q3), i.e., the height of the box is IRQ

The median (M) is marked by a line within the box

Whiskers: two lines outside the box extend to Minimum and Maximum

Page 18: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 18

Example Boxplot: Profit Analysis

Page 19: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 19

Quantile Plot

Displays all of the data for the given attributePlots quantile informationEach data point (xi, fi) indicates that approximately 100 fi% of the data are below or equal to the value xi

Page 20: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 20

Quantile-Quantile (Q-Q) Plot

Graphs the quantiles of one univariate distribution against the corresponding quantiles of anotherDiagnosing differences between the probability distribution of two distributions

Page 21: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 21

Scatter plot

Displays values for two numerical attributes (bivariate data) Each pair of values plotted as a point in the planecan suggest various kinds of correlations between variables with a certain confidence level: positive (rising), negative (falling), or null (uncorrelated).

Page 22: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

2008�1�24���� Data Mining: Concepts and Techniques 22

Example Scatter Plot – Correlation between Wine Consumption and Heart Mortality

France

US

Page 23: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

2008�1�24���� Data Mining: Concepts and Techniques 23

Positively and Negatively Correlated Data

Page 24: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

2008�1�24���� Data Mining: Concepts and Techniques 24

Not Correlated Data

Page 25: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 25

Loess Curve

Locally weighted scatter plot smoothing to provide better perception of the pattern of dependenceFitting simple models to localized subsets of the data

Page 26: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 26

Chapter 2: Data Preprocessing

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Page 27: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 27

Data Cleaning

Importance“Data cleaning is one of the three biggest problems in data warehousing”—Ralph Kimball“Data cleaning is the number one problem in data warehousing”—DCI survey

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Resolve redundancy caused by data integration

Page 28: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 28

Missing Data

Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of entry

not register history or changes of the data

Missing data may need to be inferred.

Page 29: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 29

How to Handle Missing Values?

Ignore the tuple: usually done when class label is missing (assuming

the tasks in

Fill in the missing value manually

Fill in the missing value automatically

a global constant : e.g., “unknown”, a new class?!

the attribute mean

the attribute mean for all samples belonging to the same class:

smarter

the most probable value: inference-based such as Bayesian

formula or decision tree (Chap 6)

Page 30: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 30

Noisy Data

Noise: random error or variance in a measured variableIncorrect attribute values may due to

faulty data collection instrumentsdata entry problemsdata transmission problemstechnology limitationinconsistency in naming convention

Other data problems which requires data cleaningduplicate recordsincomplete datainconsistent data

Page 31: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 31

How to Handle Noisy Data?

Binning and smoothingsort data and partition into bins (equal-frequency or equal-width)then smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Regressionsmooth by fitting the data into a function with regression

Clusteringdetect and remove outliers that fall outside clusters

Combined computer and human inspectiondetect suspicious values and check by human (e.g., deal with possible outliers)

Page 32: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 32

Simple Discretization Methods: Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the

width of intervals will be: W = (B –A)/N.

The most straightforward, but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing approximately

same number of samples

Good data scaling

Managing categorical attributes can be tricky

Page 33: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 33

Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins:- Bin 1: 4, 8, 9, 15- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34

Page 34: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 34

Regression

x

y

y = x + 1

X1

Y1

Y1’

Page 35: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 35

Cluster Analysis

Page 36: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 36

Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration

Data transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Page 37: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 37

Data Integration

Data integration: combines data from multiple sources into a unified viewArchitectures

Data warehouse (tightly coupled)Federated database systems (loosely coupled)

Database heterogeneitySemantic integration

Page 38: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Data Warehouse Approach

Client Client

Warehouse

Source Source Source

Query & Analysis

ETL

Metadata

Page 39: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Advantages and Disadvantages of Data Warehouse

AdvantagesHigh query performanceCan operate when sources unavailableExtra information at warehouse

Modification, summarization (aggregates), historical information

Local processing at sources unaffectedDisadvantages

Data freshnessDifficult to construct when only having access to query interface of local sources

Page 40: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Federated Database Systems

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

Page 41: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Advantages and Disadvantages of Federated Database Systems

AdvantageNo need to copy and store data at mediatorMore up-to-date dataOnly query interface needed at sources

DisadvantageQuery performanceSource availability

Page 42: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Database Heterogeneity

System Heterogeneity: use of different operating system, hardware platformsSchematic or Structural Heterogeneity: the native model or structure to store data differ in data sources. Syntactic Heterogeneity: differences in representation format of dataSemantic Heterogeneity: differences in interpretation of the 'meaning' of data

Page 43: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Semantic Integration

Problem: reconciling semantic heterogeneityLevels

Schema matching (schema mapping)e.g., A.cust-id ≡ B.cust-#

Data matching (data deduplication, record linkage, entity/object matching)

e.g., Bill Clinton = William ClintonChallenges

Semantics inferred from few information sources (data creators, documentation) -> rely on schema and dataSchema and data unreliable and incompleteGlobal pair-wise matching computationally expensive

In practice, 60-80% of resources spent on reconciling semantic heterogeneity in data sharing project

Page 44: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Schema Matching

TechniquesRule basedLearning based

Type of matches1-1 matches vs. complex matches (e.g. list-price = price *(1+tax_rate))

Information usedSchema information: element names, data types, structures, number of sub-elements, integrity constraintsData information: value distributions, frequency of wordsExternal evidence: past matches, corpora of schemasOntologies. E.g. Gene Ontology

Multi-matcher architecture

Page 45: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Data Matching Or … ?

record linkagedata matchingobject identificationentity resolutionentity disambiguationduplicate detectionrecord matchinginstance identificationdeduplicationreference reconciliationdatabase hardening…

Data Mining: Concepts and Techniques 45

Page 46: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

Data Matching

TechniquesRule basedProbabilistic Record Linkage (Fellegi and Sunter, 1969)

Similarity between pairs of attributesCombined scores representing probability of matchingThreshold based decision

Machine learning approachesNew challenges

Complex information spacesMultiple classes

Data Mining: Concepts and Techniques 46

Page 47: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 47

Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration

Data transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Page 48: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 48

Data Transformation

Smoothing: remove noise from data (data cleaning)

Aggregation: summarization

E.g. Daily sales -> monthly sales

Discretization and generalization

E.g. age -> youth, middle-aged, senior

(Statistical) Normalization: scaled to fall within a small, specified range

E.g. income vs. age

Attribute construction: construct new attributes from given ones

E.g. birthday -> age

Page 49: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 49

Data Aggregation

Data cubes store multidimensional aggregated information

Multiple levels of aggregation for analysis at multiple granularities

More on data warehouse and

cube computation (chap 3, 4)

Page 50: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 50

Normalization

Min-max normalization: [minA, maxA] to [new_minA, new_maxA]

Ex. Let income [$12,000, $98,000] normalized to [0.0, 1.0]. Then $73,000 is mapped to

Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then

Normalization by decimal scaling

716.00)00.1(000,12000,98000,12600,73

=+−−−

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__(' +−−

−=

A

Avvσμ−

='

j

vv10

'= Where j is the smallest integer such that Max(|ν’|) < 1

225.1000,16

000,54600,73=

Page 51: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 51

Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Page 52: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 52

Data Reduction

Why data reduction?A database/data warehouse may store terabytes of dataComplex data analysis/mining may take a very long time to run on the complete data set

Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results

Data reduction strategiesDimensionality reduction

Feature selection - attribute subset selectionFeature extraction – mapping data to a smaller number of features

Instance reduction

Page 53: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 53

Feature Selection

Select a set of attributes (features) such that the resulting probability distribution is as close as possible to the originaldistribution given all featuresBenefits

Remove irrelevant or redundant attributesreduce # of attributes in the patterns

Heuristic methods (# of choices?):Step-wise forward selectionStep-wise backward eliminationCombining forward selection and backward eliminationDecision-tree induction (Chap 6. Classification)

Page 54: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 54

Example of Decision Tree Induction

Initial attribute set:{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Page 55: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 55

Feature Extraction

Create new features (attributes) by combining/mapping existing onesMethods

Principle Component AnalysisData compression methods – Discrete Wavelet TransformRegression analysis

Page 56: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 56

Principle component analysis: find the dimensions that capture the most variance

A linear mapping of the data to a new coordinate system such that the greatest variance lies on the first coordinate (the first principal component), the second greatest variance on the second coordinate, and so on.

StepsNormalize input data: each attribute falls within the same rangeCompute k orthonormal (unit) vectors, i.e., principal components -each input data (vector) is a linear combination of the k principal component vectorsThe principal components are sorted in order of decreasing “significance”Weak components can be eliminated, i.e., those with low variance

Principal Component Analysis (PCA)

Page 57: Data Mining - Emory Universitylxiong/cs570s08/share/slides/02.pdf · 2009-07-22 · January 24, 2008 Data Mining: Concepts and Techniques 10 Measuring the Central Tendency Mean (sample

January 24, 2008 Data Mining: Concepts and Techniques 57

X1

X2

Y1Y2

Illustration of Principal Component Analysis