2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 data mining: concepts...

48
二二二二二二二 Data Mining: Concepts and Techniques 1 Data Preprocessing — Chapter 2 —

Upload: joshua-mosley

Post on 04-Jan-2016

290 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 1

Data Preprocessing

— Chapter 2 —

Page 2: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 2

Data Preprocessing Step in KDD Process

Relevant Data

Data Preprocessing

Data Mining

Evaluation/Interpretation

Pattern

Knowledge

Databases

Page 3: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Introduction to Data Mining 3

Main Steps of a KDD Process(Fully)

Domain knowledge Acquisition Learning relevant prior knowledge and goals of application

Data collection and preprocessing (may take 60% of effort!)

Data selection and integration : creating a target data set Data cleaning, data transformation, and data reduction

Data mining Choosing functions of data mining

association, classification, clustering, regression, summarization.

Choosing the mining algorithm(s) Searching for patterns of interest

Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns,

etc. Use of discovered knowledge

Page 4: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 4

Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

Page 5: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 5

Why Data Preprocessing?

Data in the real world is dirty incomplete: lacking attribute values, lacking

certain attributes of interest, or containing only aggregate data

noisy: containing errors or outliers inconsistent: containing discrepancies in codes

or names No quality data, no quality mining results!

Quality decisions must be based on quality data

Data quality is important to data mining

Page 6: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 6

Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:( 人、事、時、地、物 )AccuracyCompletenessConsistencyTimelinessBelievabilityValue addedInterpretabilityAccessibility

Page 7: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 7

Major Tasks in Data Preprocessing

Data cleaning Process missing values and noisy data

Data integration Integration of multiple files, tables, databases, or DBMSs’

DBs Data transformation

Normalization and aggregation Data reduction

Reducing data representation in volume but produces the same or similar analytical results (especially for Cloud computing)

Data discretization Part of data reduction. Reducing the number of attribute

values (especially for numerical attributes)

Page 8: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 8

Forms of data preprocessing

Page 9: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 9

Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

Page 10: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 10

Data Cleaning

Data cleaning tasks mainly include

Fill in missing values

Identify outliers and smooth out noisy

data

Correct inconsistent data

Page 11: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 11

Missing Data

Data is not always available

e.g., many tuples may have no recorded value for

several attributes, such as customer income in sales

data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding e.g., data may not be considered important at the time of

entry

not registered history or changes of the data

Missing data may need to be inferred.

Page 12: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 12

How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing

(assuming the task is classification. Not effective when the

percentage of missing values per attribute varies considerably)

Fill in the missing value manually: tedious + infeasible?

Use a global constant to fill in the missing value,

e.g., “unknown” : a new value!

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to the same

class to fill in the missing value: smarter

Use the most probable value to fill in the missing value:

inference-based such as Bayesian formula, decision tree or

linear regression

Page 13: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 13

Noisy Data

Noise: random error or variance in a measured variable

Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention

Other data problems which requires data cleaning duplicate records incomplete data inconsistent data

Page 14: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 14

How to Handle Noisy Data?

Binning method: first sort data and partition data into (equi-depth)

bins then one can smooth by bin means, smooth by

bin median, smooth by bin boundaries, etc. Clustering

detect and remove outliers Combined computer and human inspection

detect suspicious values and check by human Regression

smooth by fitting the data into regression functions

Example

Example

Page 15: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 15

Simple Discretization Methods: Binning

Equal-width (distance) partitioning: The most straightforward method It divides the range of the attribute value into N

intervals of equal size. If A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B -A)/N. Outliers may dominate the presentation Skewed data cannot be handled well.

Equal-depth (frequency) partitioning: It divides the range into N intervals. Each interval

contains approximately a same number of samples Good data scaling Managing categorical attributes can be tricky.

Page 16: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 16

Data Smoothing for Equal-Depth Binning

1). Sorted data for price : 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

2). Partition into 3 (equi-depth) bins: - Bin 1: - Bin 2: - Bin 3:3). Smoothing

* Smoothing by bin means: - Bin 1: - Bin 2: - Bin 3: * Smoothing by bin boundaries: - Bin 1: - Bin 2: - Bin 3:

9, 9, 9, 923, 23, 23, 2329, 29, 29, 29

4, 4, 4, 1521, 21, 25, 2526, 26, 26, 34

4, 8, 9, 1521, 21, 24, 2526, 28, 29, 34

Page 17: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 17

Detect Outliers by Cluster Analysis

1. Cluster analysis 2. Outlier detection

XY

ZPoints X, Y, and Z are outliers

Page 18: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 18

Data Smoothing by Regression

x

y

y = x + 1

X1

Y1

1. Regression 2. Detect outlier3. Outlier smoothing Y1’(X1,Y1) (X1,Y1’)

According to y = x + 1

Page 19: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 19

Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

Page 20: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 20

Data Integration

Data integration of various source combines data from multiple sources into a

coherent store (multiple files, databases, or DBMSs’ DBs)

Schema (table) integration integrate metadata from different sources Entity identification problem: identify real world

entities from multiple data sources, e.g., A.cust-id B.cust-#

Detecting and resolving data value conflicts for the same real world entity, attribute values

from different sources may be different possible reasons:

different representations, different scales, e.g., metric vs. British units

Page 21: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 21

Handling Redundant Data in Data Integration

Redundant data occur often when integration of multiple databases The same attribute may have different names in

different databases One attribute may be a “derived” attribute in

another table, e.g., annual revenue, total price Redundant data may be able to be detected by

correlational analysis Careful integration of the data from multiple sources

may help reduce/avoid data redundancies and inconsistencies and improve mining speed and quality

Page 22: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 22

Data Transformation

Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified

range min-max normalization z-score normalization normalization by decimal scaling

Attribute/feature construction New attributes constructed from the given ones

Page 23: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 23

Data Transformation: Normalization

min-max normalization for Attribute A

z-score normalization for Attribute A

normalization by decimal scaling for Attribute A

AAA

AA

A minnewminnewmaxnewminmax

minvv AA _)__('

A

A

devstandard

meanvv A

A

AAA _'

j

AA

vv

10' Where j is the smallest integer such that Max(| |)<1'v

Page 24: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 24

Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

Page 25: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 25

Data Reduction Strategies

Data warehouse/Cloud may store terabytes of dataComplex data analysis/mining may take a very long time to run on a complete data set

Data reduction Obtains a reduced representation of the data set

that is much smaller in volume but yet produces the same (or almost the same) analytical results

Data reduction strategies Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation

Page 26: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 26

Dimensionality Reduction

Feature selection (i.e., attribute subset selection): Select a minimum set of features from the original

features such that the class probability distribution using the minimal features is as close as possible to the distribution using the original features

Heuristic methods (due to exponential # of choices): decision-tree induction step-wise forward selection step-wise backward elimination combining forward selection and backward

elimination principal component analysis

Page 27: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 27

Example of Decision Tree Induction

Initial attribute set: {A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

Reduced attribute set: {A1, A4, A6}

Page 28: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 28

Other Heuristic Feature Selection Methods

There are 2d possible subsets of d features Several heuristic feature selection methods:

Best/Worst single feature : choose by significance tests(under the feature independence assumption)Best step-wise feature selection:

The best single-feature is picked first Then pick the next best feature given the first feature, ...

Step-wise feature elimination: Repeatedly eliminate the worst feature

Best combined feature selection and elimination:Optimal branch and bound:

Use feature elimination and backtracking

Page 29: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 29

Given N data vectors from k dimensions, find c (c <= k) k-dimensional orthogonal vectors that can be best used to represent the data

Transformation of data representation: O(N × k) → O(N × c)

Each original data vector is represented by a linear combination of the c principal component vectors

Each original vector is transformed into a c-dimensional vector The original data set is reduced to a new data set consisting of

N data vectors on c principal components Space complexity of representation : O(N × k) → O(N × c) Works for numeric data only Used when the number of dimensions is large

Principal Component Analysis

Example

Page 30: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 30

X

Y α

β

Principal Component Analysis(Data Representation Transformation : X Y

αβ )

X, Y : old coordinates

α,β : new coordinates

a

b(a, b)

a’

b’

(a’, b’)

(a, b) → (a’, b’)

(c, d)

c

d(c’, d’)

d’

c’ (c, d) → (c’, d’)

Page 31: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 31

Numerosity Reduction(Especially for Big Data Analysis)

Parametric methods Assume the data fits a certain model, estimate

model parameters, store only the parameters, and discard the data (except possible outliers)

Example: In linear regression, if a set of points (x1, y1), (x2, y2) …, (xn, yn) is approximated by a line Y=+ X, then and represent the given points

Non-parametric methods Do not assume models Major families: histograms, clustering, sampling

Example

Page 32: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 32

Numerosity Reduction by Regression

Given n sets (classes) of data : S1, S2, …Si, …, Sn

Linear regression: The i-th set of data is modeled to fit a straight line : Y = a i X + bi

Therefore, Si is represented by <ai, bi> The n data sets are translated into <a1,b1>, <a2,b2>,…, <an,bn>

Multiple regression: Y=a10+a11x1+a12x2+…+a1nxn=A1 • X Where, A1=< a10, a11, a12,…, a1n>, X=<1, x1, x2, …, xn> The i-th set of data is modeled to fit a straight line

Therefore, Si is represented by Ai

The n data sets are translated into n data vectors : A1, A2,…, An

Page 33: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 33

Numerosity Reduction by Histograms(Non-parametric methods)

A popular data

reduction technique

Divide data into

buckets and store

average (sum) & count

for each bucket

Related to

quantization problems. 0

5

10

15

20

25

30

35

40

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

Page 34: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 34

Numerosity Reduction by Clustering(Non-parametric methods)

Partition data set into clusters, and store

cluster representations only

Can be very effective if data is clustered,

but not if data is “smeared”

There are many choices of clustering

definitions and clustering algorithms,

further detailed in Chapter 7

Example

Page 35: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Introduction to Data Mining 35

AB

C

Use A, B, C to represent the data set

Represent Data by Cluster Centers(For Numerosity Reduction )

A, B, C are cluster representations of data

Page 36: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 36

Sampling(Non-parametric methods)

Allow a mining algorithm to run in complexity that depends on the sample size, not the size of the data

Choose a representative subset of the data Note that simple random sampling may have very

poor performance in the presence of skew Adaptive sampling method

Stratified sampling ( 分層抽樣 ): Approximate the percentage of each class (or

subpopulation) in the overall data Used in conjunction with skewed data

2 methods

Example

Page 37: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 37

Simple Random Sampling

SRSWOR

(simple random

sampling without

replacement)

Raw Data

SRSWRAfter a sample is drawn,

it is placed back so thatit may be drawn again

Page 38: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 38

Example of Stratified Sampling

Raw Data Stratified Samples

Page 39: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 39

Chapter 2: Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy

generation

Summary

Page 40: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 40

Discretization(Reduce the Number of Attribute Values)

Consider three types of attributes: Nominal — values from an unordered set Ordinal — values from an ordered set Continuous — real numbers

Discretization: Divide the range of a continuous attribute into

intervals Each interval is denoted by a new categorical value Reduce the total number of attribute values Some classification algorithms only accept

categorical attributes. Prepare for further analysis

Page 41: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 41

Discretization and Concept Hierarchy

Reduce the number of different values for a given attribute

Discretization (for continuous attribute ) reduce the number of different values for a given

continuous attribute by dividing the data range of the attribute into intervals. Then, interval labels is used to replace actual data values.

Concept hierarchies (for nominal attribute ) reduce the number of different values for a given

nominal attribute by replacing low level concepts (such as low-fat milk, 2% milk, coke, orange juice) by higher level concepts (such as milk, drink).

Page 42: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 42

Discretization and Concept Hierarchy Generation for Numeric Data

Binning (see sections before)

Histogram analysis (see sections before)

Clustering analysis (see sections before)

Entropy-based discretization

Segmentation by natural partitioning

Page 43: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 43

Entropy-Based Discretization

Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T for a continuous attribute A, the entropy after partitioning is

The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary

discretization. The process is recursively applied to the obtained

partitions until some stopping criterion is met, e.g.,

Experiments show that it may reduce data size and improve classification accuracy

)(||

||)(

||

||),,( 2

21

1SEnt

SS

SEntSSATSE

),,()( ATSESEnt

Page 44: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 44

Segmentation by natural partitioning

3-4-5 rule can be used to segment numeric data into

relatively uniform, “natural” intervals.

* If an interval covers 3, 6, 7 or 9 distinct values at the most significant decimal digit, partition the range into 3 equi-width intervals

* If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals

* If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervals

Page 45: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 45

Example of 3-4-5 rule(Generation of Concept Hierarchy for Profits)

(-$400 -$5,000)

(-$400 - 0)

(-$400 - -$300)

(-$300 - -$200)

(-$200 - -$100)

(-$100 - 0)

(0 - $1,000)

(0 - $200)

($200 - $400)

($400 - $600)

($600 - $800) ($800 -

$1,000)

($2,000 - $5, 000)

($2,000 - $3,000)

($3,000 - $4,000)

($4,000 - $5,000)

($1,000 - $2, 000)

($1,000 - $1,200)

($1,200 - $1,400)

($1,400 - $1,600)

($1,600 - $1,800) ($1,800 -

$2,000)

msd = 1,000 Low = -$1,000 High = $2,000Step 2:

Step 4:

Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High (i.e, 95%- tile) Max

count

(-$1,000 - $2,000)

(-$1,000 - 0) (0 -$ 1,000)

Step 3:

($1,000 - $2,000) Generate concept hierarchy according to Low and High

Adjust left/right boundaries according to Min and Max

Step 3

Step 4

Page 46: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 46

Automatic Generation of Concept Hierarchy

Alternative Methods

Specification of a partial ordering of attributes

explicitly at the schema level by users or experts, e.g.

day->month->quarter->year

Specification of a portion of a hierarchy by explicit

data grouping, e.g, day->month, quarter->year, …

Specification of a set of attributes, but not of their

partial ordering, in the hierarchy. The system can

then try to automatically generate the concept

hierarchy

Page 47: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 47

Specification of a set of attributes

Concept hierarchy can be automatically generated based

on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy.

country

province_or_ state

city

street

15 distinct values

65 distinct values

3567 distinct values

674,339 distinct values

Page 48: 2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —

二〇二三年四月二十日 Data Mining: Concepts and Techniques 48

Summary

Data preparation is a big issue for both warehousing

and mining

Data preparation includes

Data cleaning, data integration, data

transformation

Data reduction and feature selection

Discretization

A lot a methods have been developed but still an

active area of research