copyright © 2004 by jinyan li and limsoon wong rule-based data mining methods for classification...

49
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Upload: randell-pope

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Cop

yrig

ht ©

200

4 by

Jin

yan

Li a

nd L

imso

on W

ong

Rule-Based Data Mining Methods for

Classification Problems in Biomedical Domains

Jinyan LiLimsoon Wong

Page 2: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Cop

yrig

ht ©

200

4 by

Jin

yan

Li a

nd L

imso

on W

ong

Rule-Based Data Mining Methods for

Classification Problems in Biomedical Domains

Part 2:Rule-Based Approaches

Page 3: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Outline

• Overview of Supervised Learning• Decision Trees Ensembles

– Bagging– Boosting– Random forest– Randomization trees– CS4

Page 4: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Cop

yrig

ht ©

200

4 by

Jin

yan

Li a

nd L

imso

on W

ong

Overview of Supervised Learning

Page 5: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Computational Supervised Learning

• Also called classification• Learn from past experience, and use the

learned knowledge to classify new data• Knowledge learned by intelligent

algorithms• Examples:

– Clinical diagnosis for patients– Cell type classification

Page 6: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Data

• Classification application involves > 1 class of data. E.g., – Normal vs disease cells for a diagnosis

problem

• Training data is a set of instances (samples, points) with known class labels

• Test data is a set of instances whose class labels are to be predicted

Page 7: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Notation

• Training data{x1, y1, x2, y2, …, xm, ym}

where xj are n-dimensional vectors

and yj are from a discrete space Y.

E.g., Y = {normal, disease}.• Test data

{u1, ?, u2, ?, …, uk, ?, }

Page 8: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Training data: X Class labels Y

f(X)

A classifier, a mapping, a hypothesis

Test data: U Predicted class labels

f(U)

Copyright © 2004 by Jinyan Li and Limsoon Wong

Process

Page 9: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

x11 x12 x13 x14 … x1n

x21 x22 x23 x24 … x2n

x31 x32 x33 x34 … x3n

…………………………………. xm1 xm2 xm3 xm4 … xmn

n features (order of 1000)

m samples

class

PNP

N

gene1 gene2 gene3 gene4 … genen

Copyright © 2004 by Jinyan Li and Limsoon Wong

Relational Representation of Gene Expression Data

Page 10: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Features

• Also called attributes• Categorical features

– feature color = {red, blue, green}

• Continuous or numerical features– gene expression– age– blood pressure

• Discretization

Page 11: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

An Example

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 12: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

BiomedicalFinancialGovernmentScientific

Decision treesEmerging patternsSVMNeural networks

Classifiers (M-Doctors)

Copyright © 2004 by Jinyan Li and Limsoon Wong

Overall Picture of Supervised Learning

Page 13: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Evaluation of a Classifier

• Performance on independent blind test data

• K-fold cross validation: Given a dataset, divide it into k even parts, k-1 of them are used for training, and the rest one part treated as test data

• LOOCV, a special case of K-fold CV• Accuracy, error rate• False positive rate, false negative rate,

sensitivity, specificity, precision

Page 14: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Requirements of Biomedical Classification

• High accuracy• High comprehensibility

Page 15: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Importance of Rule-Based Methods

• Systematic selection of a small number of features used for decision making. Increase the comprehensibility of the knowledge patterns

• C4.5 and CART are two commonly used rule induction algorithms, or called decision tree induction algorithms

Page 16: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Leaf nodes

Internal nodes

Root node

A

B

B A

A

x1

x2

x4

x3

> a1

> a2

Copyright © 2004 by Jinyan Li and Limsoon Wong

Structure of Decision Trees

• If x1 > a1 & x2 > a2, then it’s A class

• C4.5, CART, two of the most widely used• Easy interpretation, but accuracy generally

unattractive

Page 17: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Elegance of Decision Trees

A

B

B A

A

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 18: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

CLS (Hunt etal. 1966)--- cost driven

ID3 (Quinlan, 1986 MLJ) --- Information-driven

C4.5 (Quinlan, 1993) --- Gain ratio + Pruning ideas

CART (Breiman et al. 1984) --- Gini Index

Brief History of Decision Trees

Page 19: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

9 Play samples

5 Don’t

A total of 14.

A Simple Dataset

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 20: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

2

outlook

windyhumidity

PlayPlay

PlayDon’t

Don’t

sunny

overcast

rain

<= 75> 75 false

true

24

33

A Decision Tree

• NP-complete problem

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 21: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Construction of a Decision Tree

• Determination of the root node of the tree and the root node of its sub-trees

Page 22: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Most Discriminatory Feature

• Every feature can be used to partition the training data

• If the partitions contain a pure class of training instances, then this feature is most discriminatory

Page 23: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Example of Partitions

• Categorical feature– Number of partitions of the training data is

equal to the number of values of this feature

• Numerical feature– Two partitions

Page 24: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Outlook Temp Humidity Windy classSunny 75 70 true PlaySunny 80 90 true Don’tSunny 85 85 false Don’tSunny 72 95 true Don’tSunny 69 70 false PlayOvercast 72 90 true PlayOvercast 83 78 false PlayOvercast 64 65 true PlayOvercast 81 75 false PlayRain 71 80 true Don’tRain 65 70 true Don’tRain 75 80 false PlayRain 68 80 false PlayRain 70 96 false Play

Instance #123456789

1011121314

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 25: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Total 14 training instances

1,2,3,4,5P,D,D,D,P

6,7,8,9P,P,P,P

10,11,12,13,14D, D, P, P, P

Outlook =sunny

Outlook = overcast

Outlook =rain

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 26: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Total 14 training instances

5,8,11,13,14P,P, D, P, P

1,2,3,4,6,7,9,10,12P,D,D,D,P,P,P,D,P

Temperature<= 70

Temperature> 70

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 27: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Three Measures

• Gini index• Information gain• Gain ratio

Page 28: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Steps of Decision Tree Construction

• Select the best feature as the root node of the whole tree

• After partition by this feature, select the best feature (wrt the subset of training data) as the root node of this sub-tree

• Recursively, until the partitions become pure or almost pure

Page 29: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Missing many globally significant rules; mislead the system

Characteristics of C4.5 Trees

• Single coverage of training data (elegance)

• Divide-and-conquer splitting strategy• Fragmentation problem• Locally reliable but globally un-significant

rules

Page 30: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Cop

yrig

ht ©

200

4 by

Jin

yan

Li a

nd L

imso

on W

ong

Decision Tree Ensembles

• Bagging• Boosting• Random forest• Randomization trees• CS4

Page 31: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

• h1, h2, h3 are indep classifiers w/ accuracy = 60%

• C1, C2 are the only classes

• t is a test instance in C1

• h(t) = argmaxC{C1,C2} |{hj {h1, h2, h3} | hj(t) = C}|

• Then prob(h(t) = C1)

= prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C1) +

prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C2) +

prob(h1(t)=C1 & h2(t)=C2 & h3(t)=C1) +

prob(h1(t)=C2 & h2(t)=C1 & h3(t)=C1)

= 60% * 60% * 60% + 60% * 60% * 40% + 60% * 40% * 60% + 40% * 60% * 60%= 64.8%

Motivating Example

Page 32: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Bagging

• Proposed by Breiman (1996)• Also called Bootstrap aggregating• Make use of randomness injected to

training data

Page 33: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

50 p + 50 nOriginal training set

48 p + 52 n 49 p + 51 n 53 p + 47 n…A base inducer such as C4.5

A committee H of classifiers: h1 h2 …. hk

Main Ideas

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 34: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Decision Making by Bagging

Given a new test sample T

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 35: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Boosting

• AdaBoost by Freund & Schapire (1995)• Also called Adaptive Boosting• Make use of weighted instances and

weighted voting

Page 36: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Main Ideas

100 instanceswith equal weight

A classifier h1error

If error is 0 or >0.5 stop

Otherwise re-weight: e1/(1-e1)

Renormalize to 1

100 instanceswith different weights

A classifier h2error

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 37: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Given a new test sample T

Decision Making by AdaBoost.M1

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 38: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Bagging vs Boosting

• Bagging– Construction of Bagging classifiers are

independent– Equal voting

• Boosting– Construction of a new Boosting classifier

depends on the performance of its previous classifier, i.e. sequential construction (a series of classifiers)

– Weighted voting

Page 39: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Random Forest

• Proposed by Breiman (2001)• Similar to Bagging, but the base inducer

is not the standard C4.5• Make use twice of randomness

Page 40: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

50 p + 50 nOriginal training set

48 p + 52 n 49 p + 51 n 53 p + 47 n…A base inducer (not C4.5 but revised)

A committee H of classifiers: h1 h2 …. hk

Main Ideas

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 41: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Rootnode

Original n number offeatures

Selection is from mtry

number of randomly chosen features

A Revised C4.5 as Base Classifier

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 42: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Decision Making by Random Forest

Given a new test sample T

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 43: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Randomization Trees

• Proposed by Dietterich (2000)• Make use of randomness in the selection

of the best split point

Page 44: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Rootnode

Original n number offeatures

Select one randomly from{feature 1: choice 1,2,3 feature 2: choise 1, 2, . . . feature 8: choice 1, 2, 3} Total 20 candidates

Equal voting on the committee of such decision trees

Main Ideas

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 45: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

CS4

• Proposed by Li et al (2003)• CS4: Cascading and Sharing for decision

trees• Don’t make use of randomness

Page 46: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Selection of root nodes is in a cascading manner!

1

2

k

tree-1

tree-2

tree-k

total k trees

root nodes

Main Ideas

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 47: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Not equal voting

Decision Making by CS4

Copyright © 2004 by Jinyan Li and Limsoon Wong

Page 48: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

BaggingRandom

Forest

AdaBoost.M1

Randomization Trees

CS4

Rules may not be correct whenapplied to training data

Rules correct

Copyright © 2004 by Jinyan Li and Limsoon Wong

Summary of Ensemble Classifiers

Page 49: Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

Copyright © 2004 by Jinyan Li and Limsoon Wong

Any Question?