dm lecture 2 18-09-2012

7/30/2019 DM Lecture 2 18-09-2012

1/36

November 14, 2012 Data Mining: Concepts and Techniques 1

Data Mining:Concepts and Techniques

Slides for Textbook

Chapter 2 &3

7/30/2019 DM Lecture 2 18-09-2012

2/36


Chapter 3: Data Preprocessing

Why preprocess the data?

Data cleaningData integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

7/30/2019 DM Lecture 2 18-09-2012

3/36


Why Data Preprocessing?

Data in the real world is dirtyincomplete : lacking attribute values, lacking certainattributes of interest, or containing only aggregate

datanoisy : containing errors or outliersinconsistent : containing discrepancies in codes ornames

No quality data, no quality mining results!Quality decisions must be based on quality dataData warehouse needs consistent integration of quality data

7/30/2019 DM Lecture 2 18-09-2012

4/36


Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view: AccuracyCompletenessConsistencyTimelinessBelievability

Value addedInterpretability

Accessibility

7/30/2019 DM Lecture 2 18-09-2012

5/36


Major Tasks in Data Preprocessing

Data cleaningFill in missing values, smooth noisy data, identify or removeoutliers, and resolve inconsistencies

Data integrationIntegration of multiple databases, data cubes, or files

Data transformationNormalization and aggregation

Data reductionObtains reduced representation in volume but produces thesame or similar analytical results

Data discretizationPart of data reduction but with particular importance, especially

for numerical data

7/30/2019 DM Lecture 2 18-09-2012

6/36


Forms of data preprocessing

7/30/2019 DM Lecture 2 18-09-2012

7/36November 14, 2012 Data Mining: Concepts and Techniques 7



Data cleaning Data integration and transformation

Data reduction


Summary

7/30/2019 DM Lecture 2 18-09-2012


Data Cleaning

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

7/30/2019 DM Lecture 2 18-09-2012


Missing Data

Data is not always available

E.g., many tuples have no recorded value for severalattributes, such as customer income in sales data

Missing data may be due toequipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of entry

not register history or changes of the data

Missing data may need to be inferred.

7/30/2019 DM Lecture 2 18-09-2012


How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (assuming

the tasks in classification not effective when the percentage of

missing values per attribute varies considerably.

Fill in the missing value manually: tedious + infeasible?Use a global constant to fill in the missing value: e.g., unknown, a

new class?!

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to the same class

to fill in the missing value: smarter

Use the most probable value to fill in the missing value: inference-

based such as Bayesian formula or decision tree

7/30/2019 DM Lecture 2 18-09-2012


Noisy Data

Noise: random error or variance in a measured variableIncorrect attribute values may due to

faulty data collection instruments

data entry problemsdata transmission problemstechnology limitationinconsistency in naming convention

Other data problems which requires data cleaningduplicate recordsincomplete data

inconsistent data

7/30/2019 DM Lecture 2 18-09-2012


How to Handle Noisy Data?

Binning method:first sort data and partition into (equi-depth) binsthen one can smooth by bin means, smooth by bin

median, smooth by bin boundaries , etc.Clustering

detect and remove outliersCombined computer and human inspection

detect suspicious values and check by humanRegression

smooth by fitting the data into regression functions

7/30/2019 DM Lecture 2 18-09-2012


Simple Discretization Methods: Binning

Equal-width (distance) partitioning:It divides the range into N intervals of equal size:uniform grid if A and B are the lowest and highest values of theattribute, the width of intervals will be: W = ( B - A )/ N. The most straightforwardBut outliers may dominate presentationSkewed data is not handled well.

Equal-depth (frequency) partitioning:It divides the range into N intervals, each containingapproximately same number of samplesGood data scaling

Managing categorical attributes can be tricky.

7/30/2019 DM Lecture 2 18-09-2012


Binning Methods for Data Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,29, 34

* Partition into (equi-depth) bins:- Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:- Bin 1: 9, 9, 9, 9- Bin 2: 23, 23, 23, 23- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:- Bin 1: 4, 4, 4, 15- Bin 2: 21, 21, 25, 25- Bin 3: 26, 26, 26, 34

7/30/2019 DM Lecture 2 18-09-2012


Cluster Analysis

7/30/2019 DM Lecture 2 18-09-2012




Data cleaning

Data integration and transformation

Data reduction


Summary

7/30/2019 DM Lecture 2 18-09-2012


Data Integration

Data integration:combines data from multiple sources into a coherentstore

Schema integrationintegrate metadata from different sourcesEntity identification problem: identify real world entitiesfrom multiple data sources, e.g., A.cust-id B.cust-#

Detecting and resolving data value conflicts

for the same real world entity, attribute values fromdifferent sources are differentpossible reasons: different representations, differentscales, e.g., metric vs. British units

7/30/2019 DM Lecture 2 18-09-2012


Handling Redundant Datain Data Integration

Redundant data occur often when integration of multipledatabases

The same attribute may have different names in

different databasesOne attribute may be a derived attribute in anothertable, e.g., annual revenue

Redundant data may be able to be detected by

correlational analysisCareful integration of the data from multiple sources mayhelp reduce/avoid redundancies and inconsistencies andimprove mining speed and quality

7/30/2019 DM Lecture 2 18-09-2012


Data Transformation

Smoothing: remove noise from data Aggregation: summarization, data cube constructionGeneralization: concept hierarchy climbingNormalization: scaled to fall within a small, specifiedrange

min-max normalization

z-score normalizationnormalization by decimal scaling

Attribute/feature constructionNew attributes constructed from the given ones

7/30/2019 DM Lecture 2 18-09-2012


Data Transformation:Normalization

min-max normalization

z-score normalization

normalization by decimal scaling

A A A

A A

A

minnewminnewmaxnewminmax

minvv _)__('

A

A

devstand

meanvv

_'

j

vv

10' Where j is the smallest integer such that Max(| |)

7/30/2019 DM Lecture 2 18-09-2012




Data cleaning


Data reduction


Summary

7/30/2019 DM Lecture 2 18-09-2012


Data Reduction Strategies

Warehouse may store terabytes of data: Complex dataanalysis/mining may take a very long time to run on thecomplete data setData reduction

Obtains a reduced representation of the data set that ismuch smaller in volume but yet produces the same (oralmost the same) analytical results

Data reduction strategiesData cube aggregationDimensionality reductionNumerosity reduction


7/30/2019 DM Lecture 2 18-09-2012


Example of Decision Tree Induction

Initial attribute set:{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

7/30/2019 DM Lecture 2 18-09-2012

24/36


Data Compression

String compressionThere are extensive theories and well-tuned algorithmsTypically losslessBut only limited manipulation is possible withoutexpansion

Audio/video compressionTypically lossy compression, with progressive

refinementSometimes small fragments of signal can bereconstructed without reconstructing the whole

Time sequence is not audio

Typically short and vary slowly with time

7/30/2019 DM Lecture 2 18-09-2012

25/36


Data Compression

Original DataCompressed

Data

lossless

Original DataApproximated

7/30/2019 DM Lecture 2 18-09-2012

26/36


Histograms

A popular data reductiontechniqueDivide data into buckets

and store average (sum)for each bucketCan be constructedoptimally in one

dimension using dynamicprogrammingRelated to quantizationproblems. 0

510

15

20

25

30

35

40

1

0 0 0 0

2

0 0 0 0

3

0 0 0 0

4

0 0 0 0

5

0 0 0 0

6

0 0 0 0

7

0 0 0 0

8

0 0 0 0

9

0 0 0 0

1 0

0 0 0 0

7/30/2019 DM Lecture 2 18-09-2012

27/36


Clustering

Partition data set into clusters, and one can store cluster

representation only

Can be very effective if data is clustered but not if datais smeared

Can have hierarchical clustering and be stored in multi-

dimensional index tree structuresThere are many choices of clustering definitions and

clustering algorithms, further detailed in Chapter 8

7/30/2019 DM Lecture 2 18-09-2012

28/36


Sampling

Allow a mining algorithm to run in complexity that ispotentially sub-linear to the size of the dataChoose a representative subset of the data

Simple random sampling may have very poorperformance in the presence of skew

Develop adaptive sampling methodsStratified sampling:

Approximate the percentage of each class (orsubpopulation of interest) in the overall databaseUsed in conjunction with skewed data

Sampling may not reduce database I/Os (page at a time).

7/30/2019 DM Lecture 2 18-09-2012

29/36


Sampling

Raw Data

7/30/2019 DM Lecture 2 18-09-2012

30/36


Sampling

Raw Data Cluster/Stratified Sample

7/30/2019 DM Lecture 2 18-09-2012

31/36


Hierarchical Reduction

Use multi-resolution structure with different degrees of reductionHierarchical clustering is often performed but tends todefine partitions of data sets rather than clusters Parametric methods are usually not amenable tohierarchical representationHierarchical aggregation

An index tree hierarchically divides a data set intopartitions by value range of some attributesEach partition can be considered as a bucketThus an index tree with aggregates stored at eachnode is a hierarchical histogram

7/30/2019 DM Lecture 2 18-09-2012

32/36




Data cleaning


Data reduction


Summary

7/30/2019 DM Lecture 2 18-09-2012

33/36


Discretization

Three types of attributes:Nominal values from an unordered setOrdinal values from an ordered setContinuous real numbers

Discretization:divide the range of a continuous attribute intointervalsSome classification algorithms only accept categoricalattributes.Reduce data size by discretizationPrepare for further analysis

7/30/2019 DM Lecture 2 18-09-2012

34/36


Discretization and Concept hierachy

Discretization reduce the number of values for a given continuousattribute by dividing the range of the attribute intointervals. Interval labels can then be used to replaceactual data values.

Concept hierarchies

reduce the data by collecting and replacing low levelconcepts (such as numeric values for the attributeage) by higher level concepts (such as young,middle-aged, or senior).

7/30/2019 DM Lecture 2 18-09-2012

35/36


Specification of a set of attributes

Concept hierarchy can be automatically generated basedon the number of distinct values per attribute in thegiven attribute set. The attribute with the mostdistinct values is placed at the lowest level of thehierarchy.

country

province_or_ state

city

street

15 distinct values

65 distinct values

3567 distinct values

674,339 distinct values

7/30/2019 DM Lecture 2 18-09-2012

36/36



Data cleaning


Data reduction


Summary

dm lecture 2 18-09-2012

Documents