classification by search partition analysis: an alternative to tree methods in medical problems....

43
search partition analysis: an alternative to tree methods in medical problems. Roger Marshall School of Population Health University of Auckland New Zealand [email protected]

Upload: amber-price

Post on 18-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Classification by search partition analysis: an alternative to tree methods in medical problems.

• Roger Marshall• School of Population Health• University of Auckland• New Zealand• [email protected]

Why classification?

Uses:

to develop diagnostic/prognostic decision and classification rules

“discover” homogeneous subgroups e.g at risk inepidemiology, who respond to treatment

Methods:

Regression methods (including neural networks) – model based

Trees – no model (unless hierarchical tree considered as such)

Empirical density methods–Smoothers of parameter space

Support Vector machines – find margins of maximum separation

Boolean classifiers (including SPAN, rough sets) – based on logical structures

Attractions of trees (in medicine):

regression models perceived as unrealistic

regression models based on arithmetic scores

trees demarcate individuals with “clusters” ofcharacteristics

closer affinity to clinical reasoning

Feinstein (circa 1971):

“….clinicians don’t think in terms of weighted averagesof clinical variables … they think about demarcated subgroups of people who possess combinations ofclinical attributes…..”

Suggested trees for “prognostic stratification”.

A^A

^OO

^SS

O

H

G^G

^O

^H

High risk of diabetes:A=age 45+, O=obese, H=hypertension, S-sedentary(Herman et al, Diabetes care 1995)

AO

A^OH

A^O^HG

^AOS

AO

H^H

C^C

P^P

U^U

V^V

B^B

High risk asthma. (Lieu et al, Am J Resp Care, 1998)H=regularly hospitalized, C=cromolyn, B=beta agonist, P=regular prescriptions, U=urgent admissions, V=regular visits

High risk if:(H C) ( ^H P U) ( ^H ^P BV).

Trees History:

1960-70’s AID (Sonquist and Morgan), CHAID, 1980’s CART, C4.5, “Machine Learning” 1990’s-- New ideas Bagging, boosting, Bayes, random forests

Software:

CART, S+, KnowledgeSeeker, C4.5 (C5?), SPSS AnswerTree, RECAMP, QUEST, CAL5,SAS macros, SAS Enterprise Miner, R rpart

Measure of best split: “Goodness of split”

e.g. statistical test based measures:chi-square (categorical y), t, F-statistics (y continous), log-rank for “survival trees”

eg. decrease in “impurity” by splitting(CART)

eg. Likelihood (deviance) statistics (Ripley, S+).

CART ideas on impurity:

Binary outcome classes D1 and D2 . Define an impurity measure i(t) of node t. p=proportion of D1 ‘s at node t

e.g Gini diversity impurity is i(t) = p(1-p)

e.g or Entropy measure:

i(t) = -p log p – (1-p) log(1-p)

Change in impurity by splitting of into left L and right R nodes node t is

i(t)-pLi(tL)-pRi(tR)

e.g. Entropy deems a node with p=0.25 more “impure” than Gini

“Right size trees”

1. Stopping rules (subgroup sample size, P-values) (pre-pruning)

2. Grow big tree and prune (post pruning)

e.g MCCP (minimum cost complexity pruning)Complexity c= # terminal nodes

Use cross-validation to estimate prediction error.

Many other methods

New methods/extensions:

Multivariate trees (multivariate splits)

Multivariate y

Survival trees (Segal, LeBlanc)

Bagging/Boosting (Brieman)

Bayes trees (Buntine, Chipman)

Forests of Trees (Breiman)

Some Troubles with trees:

Net (main) effects not evaluated

Misleading decision rules

Rules hard to interpret

Simple rules probably missed

Tree itself as a model

A^A

^OO

^SS

O

H

G^G

^O

^H

High risk of diabetes:A=age 45+, O=obese, H=hypertension, S-sedentary(Herman et al, Diabetes care 1995)

AO

A^OH

A^O^HG

^AOS

Misleading classification rules

High risk of diabetes:

(A O) (^A O S) (A^O H) (A^O^H G)

which is same as

(A O) (O S) (A H) (A G)

i.e. ^A, ^O are redundant.

e.g. Don’t need to be young (^A) in conjunction with O, S

Simple rules may require complex trees(“replication problem”)

Eg. (A B) (C D)

A=age under 18, B=black, C= cigarette smoker and D=drinks alcohol. Needs a tree with 7 nodes.

Eg. ( A B C) (D E F) (G H K)

needs a tree with 80 terminal nodes!

A

B

C

D

C

D^D

^D

^C

^C

^B

^A

AB

A^BCD^ACD

Tree for(A B) (C D)

Positive attributes

Usually (in medicine at least) an attribute can be considered in advance to be either “positively” or “negatively” associated with y

e.g obese, sedentary, hypertensive are “positive” fordiabetes.

e.g. Smoking, old, high cholesterol “positive” for ischaemia

eg. Presence of an adverse gene

“Regular” classification rules

Combinations of “positive” attributes only to definehigh risk

Tree rules not usually regular

(though occasionally may reduce to a regular rule,as in diabetes example).

H^H

C^C

P^P

U^U

V^V

B^B

High risk asthma. (Lieu et al, Am J Resp Care, 1998)H=regularly hospitalized, C=cromolyn, B=beta agonist, P=regular prescriptions, U=urgent admissions, V=regular visits

High risk if:(H C) ( ^H P U) ( ^H ^P BV)e.g ( ^H P U) not regularly hospitalized.

Tree model

Is the hierarchical tree “model” sensible?

Probably not

Even if it is……

…… does process of subdivisionestimate the best tree?

Maybe?

The considerations suggest:

Why not consider non-hierarchical procedures?

and

Why not focus on regular combinations directly?

SPAN attempts to

SPAN (Search Partition Analysis)

Generates regular decision rules of the form:

A=K1 or[and] K2 or[and] …Kq

where Ki is the conjunction[disjunction] of pi attributes.

Binary partitions of the predictor space into A and A-.

Non-hierarchical

Example: SPAN rule for detecting malignant cells (bcw data) fromcell chracteristics

SPAN Carries out a search to find best possible combinations of attributes

Unless search is somehow limited, becomes impossibly large!

e.g 22 -1 - 1

ways to form a binary partition for m attributes

e.g. 2147 million for m=5

m

How SPAN limits extent of search:

• By restricting to a set of m attributes Tm ={ X1,…Xm}

• Typically m<15. These may be the m “best” attributes

• By not allowing “mixed” combinations of attributes of those in Tm.

• By restricting complexity of Boolean expressions i.e pi and q parameters

Attribute set Tm

If the set of m attributes

Tm ={ X1,…Xm}

consists of attributes labelled “positive”, SPAN will generate only regular partitions.

Natural to consider the best ranked attributes

Ranked plot of attributes: GI-Cancer and tumour markers

Extent of search for different parameters

Based on cominatoric formulae of “lock and key”algorithm forgenerating partitions.

search over T

Best partition Aj – attribute aj

Make T={Tm, aj}j=j+1

j=1, T=Tm

Produces a sequence of new attributes with increasingbetter discrimination (no proof of this assertion!):

Iteratedsearchprocedure

Continue untilAj=Aj+1

SPAN Rank plot a_1…a_5 are partition attributes on 5 iterations(hea data)

Complexity penalising

To avoid “overfitting” penalise “complex” Boolean expressionsHow to measure complexity? - by number of subgroups (minus 1).

3 subgroups in A, 2 in A- Complexity c =3 + 2 -1 =4

Penalise measure (eg entropy) G by G-c

Visualising subgroups

Extension to >2 ordinal states. e.g Categories 0, 1, 2

Can find binary partition A2 of {0,1} v {2}

also A1 of {0} v {1,2}

C0

C1

C2

A1

A2

C0

C1

C2

A1

A2

Need to ensure A2 is subset of A1. Constrain search.

E.g diabetes 0=none, 1=imparied glucose tolerance, 2=diabetes

A2 = (FU)(FET)A1 = (FU)(FT)(FH)

F, T and U denote positive fructosamine, triglyceride and urinary albumin tests. E is ethnic Polynesian

Can be shown A2 subset of A1

Comparisons of SPAN and other methods

•Lim, Loh and Shih (Machine Learning) compared 33 methods on 32 data sets

•Methods – 22 tree, 9 “statistical”, 2 neural networks.

•16 data sets (plus 16 with added noise)

•Seems to provide benchmarks for other methods.

•I tried SPAN on the 24 2-state and 3-state classification data sets.

Data # classes

SPAN error

LLS Range of 33 methods

bcw 2 0.035 0.03-0.09

bcw+ 0.035 0.03-0.08

bld 2 0.365 0.28-0.43

bld+ 0.373 0.29-0.44

bos 3 0.236 0.221-0.314

bos+ 0.236 0.225-0.422

cmc 3 0.449 0.43-0.60

cmc+ 0.444 0.43-0.58

dna 3 0.075 0.05-0.38

dna+ 0.075 0.04-0.38

hea 2 0.170 0.14-0.34

hea+ 0.170 0.15-0.31

Data # classes

SPAN error

LLS Range of 33 methods

pid 2 0.251 0.22-0.31

pid+ 0.252 0.22-0.32

smo 3 0.305 (0.44) 0.30-0.45

smo+ 0.305(0.44) 0.31-0.45

tae 3 0.510 0.325-0.693

tae+ 0.701 0.445-0.696

thy 3 0.0134 0.005-0.89

thy+ 0.0134 0.01-0.88

vot 2 0.044 0.04-0.06

vot+ 0.044 0.04-0.07

wav 3 0.266 0.151-0.477

wav+ 0.266 0.160-0.446

Data POL SPAN QI0 LOG LDA IC0 RBF ST0

bcw 19 7 2.5 5.5 12 23 5.5 30

bcw+ 11 8 1.5 8 9 27 4 20

bld 3 26 7 9 18.5 20 24 10

bld+ 1 25 26 17 15 7 18 6

bos 2.5 6 28 11.5 14.5 18.5 2.5 22.5

bos+ 3 2 22.5 13 20 11 28 17.5

cmc 1 5 14 19 22 7.5 10 9

cmc+ 1 4 17 18.5 22.5 5 15.5 9

dna 2 21 18 16 12 10 31 10

dna+ 3 20 19 17 13.5 6 31 6

pid 16.5 31 5 11 1.5 16.5 11 16.5

pid+ 1 30 2.5 7 4 13.5 15 30

hea 9 7.5 4.5 6 1 18 11.5 30

hea+ 16 6 6 9 2 16 8 23

Data POL SPAN QI0 LOG LDA IC0 RBF ST0

smo 9.5 9.5(32) 9.5 9.5 9.5 26 18.5 9.5

smo+ 7.5 9.5(32) 7.5 16.5 21 7.5 23.3 7.5

tae 20 19 10 11 6 3 13 30.5

tae+ 16 11 9 4 2 20 10 32

thy 14.5 14.5 17 26 28 8 23 5.5

thy+ 15 15 17.5 25 27 10 29 7.5

vot 25.5 9 1 21 15 17.5 25.5 21

vot+ 16 5 21 5 16 26 21 19

wav 5 21 8.5 2 10.5 29 1 26

wav+ 6 21 9 3.5 7.5 27.5 31 26

Mean Rank

9.3 11.1(12.9)

11.8 12.1 12.9 15.6 17.1 17.7

Mean error

0.210 0.200(0.230)

0.219 0.215 0.216 0.223 0.249 0.247

Limitations/Criticisms

Multi-class problems difficult

“Data dredging”

Loss of information by “cutpoints” of continuous vars.

Complexity penalising somewhat ad hoc

Computationally intensive, unless search sensibly restricted

Not “black-box” – requires user judgements

Needs (temperamental!) SPAN software – no R algorithms

Conclusion

Despite popularity trees have weakness that stem from their hierachical structure.

SPAN offers an alternative that is non-hierarchical

SPAN generally performs as well or better than trees.

Offers decision rules that are generally easy to understand.