2011 data mining industrial & information systems engineering chapter 2: overview of data mining...

42
2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science & Technology

Upload: thomasine-grant

Post on 19-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

2011 Data MiningIndustrial & Information Systems Engineering

Chapter 2:Overview of Data Mining Process

•Pilsung Kang•Industrial & Information Systems Engineering

•Seoul National University of Science & Technology

Page 2: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

2

2011 Data Mining, IISE, SNUT

Data Mining Definition Revisited

Extracting useful information from large datasets. (Hand et al., 2001)

Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. (Berry and Linoff, 1997, 2000)

Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amount data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques. Gartner Group, 2004)

Page 3: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

3

2011 Data Mining, IISE, SNUT

Descriptive vs. Predictive (purpose)

Look back to the past

To extract compact and

easily understood

information from large,

sometimes gigantic

database.

OLAP (online analytical

processing), SQL (structured

query language).

Predict the future

Identify strong links between

variables of data.

To predict the unknown

consequence (dependent

variable) based on the

information provided

(independent variable)

y = f(x1, x2, ..., xn) + ε

Descriptive Modeling Predictive Modeling

Page 4: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

4

2011 Data Mining, IISE, SNUT

Supervised vs. Unsupervised (methods)

Goal: predict a single

“target” or “outcome”

variable.

Finds relations between X

and Y.

Train (learn) data where

target value is known.

Score data where target

value is not known.

 Explores intrinsic

characteristics.

Estimates underlying

distribution.

Segment data into

meaningful groups or detect

patterns.

There is no target (outcome)

variable to predict or classify.

Supervised Learning Unsupervised Learning

Page 5: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

5

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Data Visualization

Graphs and plots of data.

Histograms, boxplots, bar charts, scatterplots.

Especially useful to examine relationships between pairs of

variables.

Descriptive & Unsupervised

1

Page 6: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

6

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Data Reduction

Distillation of complex/large data into simpler/smaller data.

Reducing the number of variables/columns.

Also called dimensionality reduction(variable selection,

variable extraction, e.g., principal component analysis)

Reducing the number of records/rows.

Also called data compression (e.g., sampling and clus-

tering)

Descriptive & UnsupervisedData Visualization + Data Reduction = Data Explo-

ration

2

Page 7: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

7

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Segmentation/Clustering

3

Goal: divide the entire data into a small number of sub-

groups.

Homogeneous within groups while heterogeneous between

groups.

Examples: Market segmentation, social network analysis.

Descriptive & Unsupervised

Page 8: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

8

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Segmentation/Clustering example: hierarchical clustering

3

Page 9: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

9

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Classification

Goal: predict categorical target (outcome) variable.

Examples: Purchase/no purchase, fraud/no fraud, creditwor-

thy/not creditworthy.

Each row is a case/record/instance.

Each column is a variable.

Target variable is often binary (yes/no).

Predictive & Supervised

4

Page 10: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

10

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Classification Example: Decision Tree

4

Page 11: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

11

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Classification Example: Logistic Regres-sion

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Play if 1/(1+exp(-0.2*outlook+0.4*humidity+0.8*windy) >

0.5

Else, do not play

4

Page 12: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

12

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Classification Examples

“Separate the riding mower buyers(●) from non-buyers(○)”

(x-axis: income(x$1000), y-axis: Lot size (x1000 sqft))

4

Page 13: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

13

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Prediction

Goal: predict numerical target (outcome) variable.

Examples: sales, revenue, performance.

As in classification:

Each row is a case/record/instance.

Each column is a variable.

Taken together, classification and prediction

constitute “predictive analytics”

Predictive & Supervised

5

Page 14: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

14

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Prediction Example: Neural Networks

5

Page 15: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

15

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Association Rule

Goal: produce rules that define “what goes with what”

Example: “If X was purchased, Y was also purchased”

Rows are transactions.

Used in recommender systems – “Our records show you

bought X, you may also like Y”

Also called “affinity analysis,” or “market basket analysis”

Predictive & Unsupervised

6

Page 16: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

16

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Association Rule Example: Market Basket Analysis

Wall Mart (USA) E-Mart (Korea)6

Page 17: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

17

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Novelty Detection

Goal: identify if a new case is similar to the given ‘normal’

cases.

Example: medical diagnosis, fault detection, identity verifi-

cation.

Each row is a case/record/instance.

Each column is a variable.

No explicit target variable, but assumed that all records

have the same target.

Also called “outlier detection,” or “one-class classification”

Predictive & Unsupervised7

Page 18: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

18

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Novelty Detection Example: Keystroke Dynamics-based User Authentication

http://ksd.snu.ac.kr7

Page 19: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

19

2011 Data Mining, IISE, SNUT

Data Mining Techniques

Descriptive Model-ing

Predictive Modeling

Supervised

Learning

Unsuper-vised

Learning

• … • Classification

• Prediction

• Data Visualization

• Data Reduction

Segmentation/clusterin

g

• Association Rules

• Novelty Detection

Page 20: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

20

2011 Data Mining, IISE, SNUT

Steps in Data Mining

1. Define and understand the purpose of data mining

project

2. Formulate the data mining problem

3. Obtain/verify/modify the data

5. Build data mining models

6. Evaluate and interpret the results

7. Deploy and monitor the model

4. Explore and customize the data

Page 21: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

21

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Define and understand the purpose of data mining project Why do we have to conduct this project?

What would be the achievement if the project succeed?

1

(Jun, 2010: http://www.kdnuggets.com)

Page 22: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

22

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Formulate the data mining problem

What is the purpose?

Increase sales.

Detect cancer patients.

What data mining task is appropriate?

Classification.

Prediction.

Association rules, …

2

Page 23: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

23

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Obtain/verify/modify the data: Data acquisition

Data source

Data warehouse,

Data mart, …

Define input variables and target variable if neces-

sary

Ex: Churn prediction for credit card service

• Inputs: age, sex, tenure, amount of spending, risk

grade,…

• Target: whether he/she leaves the company.

3

Page 24: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

24

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Obtain/verify/modify the data: Outlier detection

Outlier

“A value that the variable cannot have” or “ An ex-

tremely rare value” (ex: age 990, height -150cm, …)

There are a number of outliers in a real database due to

many reasons.

How to deal with outliers?

Ignore the record with outliers if total record is suffi-

cient.

Replace with another value (mean, median, estimate

from a certain pdf, etc) if total records are insufficient.

3

Page 25: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

25

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Obtain/verify/modify the data: Missing Value Im-putation Missing value

A variable is missing when it has null value in database

although it should have a certain real value.

Operational errors, human errors.

How to deal with missing values?

Ignore the record with missing values if total record is

sufficient.

Replace with another value (mean, median, estimate

from a certain pdf, etc) if total records are insufficient.

3

Page 26: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

26

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Obtain/verify/modify the data: Variable handling

Type of variables

Binary: 0/1 (ex: benign/malignant in medical diagno-

sis).

Categorical: more than two values, ordered (high,

middle, low) or not ordered (ex: color, job).

Ordinal: continuous, differences between two consecu-

tive values are not identical (ex: rank of the final exam).

Interval: continuous, difference between two consecu-

tive values are identical (ex: age, height, weight).

3

Page 27: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

27

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Obtain/verify/modify the data: Variable handling

Variable transformation

Binning:• interval → binary or ordered categorical.

1-of-C coding: • unordered categorical → binary.

Low Mid High“Color: yellow, red, blue,

green”d1 d2 d3

yel-low 1 0 0

red 0 1 0

blue 0 0 1

green 0 0 0

3

Page 28: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

28

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Explore and customize the data: Data Visualiza-tion Single variable

4

Histogram:• shows the distribution of a single variable.• possible to check the normality.

Box plot

0

20

40

60

80

100

120

140

160

180

5 10 15 20 25 30 35 40 45 50

Freq

uency

MEDV

Histogram

medianquartile 1

“max”

“min”

outliers

mean

outlier

quartile 3

Page 29: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

29

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Explore and customize the data: Data Visualiza-tion

4

Multiple variables

Correlation table:

• indicate which variables are highly (positively or

negatively) correlated.

• Help to remove irrelevant variables or select repre-

sentative variables

CRIM ZN INDUS CHAS NOX RMCRIM 1ZN -0.20047 1INDUS 0.406583 -0.53383 1CHAS -0.05589 -0.0427 0.062938 1NOX 0.420972 -0.5166 0.763651 0.091203 1RM -0.21925 0.311991 -0.39168 0.091251 -0.30219 1

Page 30: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

30

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Explore and customize the data: Data Visualiza-tion

4

Multiple variables

Scatter plot matrix:

• Shows the relations between two pairs of variables.

Var. 1

Var. 2

Var. 3

Var. 4

Page 31: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

31

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Explore and customize the data: Dimensionality Reduction

4

Curse of dimensionality

The number of records increases exponentially to sus-

tain the same explain ability as the number of variables

increases.

“If there are various logical ways to explain a certain phenomenon, the simplest is the best” - Occam’s Razor

21=2 22=4 23=8

Page 32: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

32

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Explore and customize the data: Dimensionality Reduction

4

Variable reduction

Select a small set of relevant variables.

Correlation analysis, Kolmogorov-Sminrov test, …

V1 V2 V3 V4 V5 V6

V1 1 0.9 -0.8 0.1 0.2 0

V2 1 -0.7 0.2 0.1 0.1

V3 1 -0.1 0.1 -0.1

V4 1 0.9 0.3

V5 1 -0.9

V6 1

Select

V1 & V4

Page 33: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

33

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Explore and customize the data: Dimensionality Reduction

4

Variable extraction

Construct a new variable that contains more intensive

information than original variables.

Principal component analysis (PCA), …

Example:

Original variables:

• Age, sex, height, weight

• Income, property, tax paid

Constructed variables:

• Var1: age+3*I(sex = female)+0.2*height-0.3*weight

• Var2: Income + 0.1*property + 2*tax paid

Page 34: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

34

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Explore and customize the data: Instance Reduc-tion

4

Random sampling

Select a small set of records with uniformly distributed

sampling rate.

In classification, class ratios are preserved.

Stratified sampling

Select a set of records such that rare events have

higher probability to be selected.

In classification, class ratios are modified.

• Under-sampling: preserve minority, reduce majority.

• Over-sampling: preserve majority, increase minority.

Page 35: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

35

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Explore and customize the data: Data separation

4

Over-fitting

Occurs when data mining algorithms ‘memorize’ the

given data, even unnecessary (noise, outlier, etc.).

0 2 4 6 8 100

2

4

6

8

10

0 2 4 6 8 100

2

4

6

8

10

Page 36: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

36

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Explore and customize the data: Data partition

4

Training Data

Used to build a model or learn data mining algorithm.

Validation Data

Used to select the best parameters for the model.

Test Data

Used to select the best model among algorithms con-

sidered.Training DataAlgorithm A-1Algorithm A-2Algorithm A-3Algorithm B-1 Algorithm B-2Algorithm B-3

Validation DataAlgorithm A-1Algorithm A-2Algorithm A-3Algorithm B-1 Algorithm B-2Algorithm B-3

Test DataAlgorithm A-1Algorithm A-2Algorithm A-3Algorithm B-1 Algorithm B-2Algorithm B-3

Page 37: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

37

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Explore and customize the data: Data normaliza-tion

4

Normalization (Standardization)

Eliminate the effect caused by different measurement

scale or unit.

z-score: (value-mean)/(standard deviation).

Id Age Income

1 25 1,000,000

2 35 2,000,000

3 45 3,000,000

… … …

Mean 35 2,000,000

Stdev 5 1,000,000

Id Age Income

1 -2 -1

2 0 0

3 2 1

… … …

Mean 0 0

Stdev 1 1

Original data Normalized data

Page 38: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

38

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Build data mining models

Data mining algorithm

Classification

• Logistic regression, k-nearest neighbor, naïve bayes,

classification trees, neural networks, linear discrimi-

nant analysis.

Prediction

• Linear regression, k-nearest neighbor, regression

trees, neural networks.

Association rules: A priori algorithm.

Clustering: Hierarchical clustering, K-Means clustering.

5

Page 39: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

39

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Evaluate and interpret the results

Classification performance

Confusion matrix

Simple accuracy: (A+C)/(A+B+C+D)

Balanced correction rate:

Lift charts, receiver operating characteristic (ROC)

curve, etc.

6

Predicted

1(+) 0(-)

Ac-tual

1(+)True positive,Sensitivity (A)

False nega-tive,

Type I error (B)

0(-)

False posi-tive,

Type II error (C)

True nega-tive,

Specificity (D)

DC

D

BA

A

Page 40: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

40

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Evaluate and interpret the results

Prediction performance

y: actual target value, y’: predicted target value

• Mean squared error, Root mean squared error

• Mean absolute error

• Mean absolute percentage error6

n

i ii yyn

MSE1

2)(1

n

i ii yyn

RMSE1

2)(1

n

i ii yyn

MAE1

1

n

i iii yyyn

MAPE1

/1

0 2 4 6 8 100

2

4

6

8

10

Page 41: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

41

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Evaluate and interpret the results

Clustering

Within variance: variance among record in a single

cluster.

Between variance: variance between clusters.

Good clustering: high between variance and low within

variance.

Association rules

Support:

Confidence:

Lift:

6

),( BAP

)(

),()|(

BP

BAPBAP

)()(

),(

)(

)|(

BPAP

BAP

BP

BAP

Page 42: 2011 Data Mining Industrial & Information Systems Engineering Chapter 2: Overview of Data Mining Process Pilsung Kang Industrial & Information Systems

42

2011 Data Mining, IISE, SNUT

Steps in Data Mining

Deploy and monitor the model

Deployment

Integrate the data mining model into operational sys-

tem.

Run the model on real data to produce decisions or ac-

tions.

• “Send Mr. Kang a coupon because his likelihood to

leave the company next month is 80%”

Monitoring

Evaluate the performance of the model after deploy-

ment.

Update or redevelop if necessary.7