classification by search partition analysis: an alternative to tree methods in medical problems....
TRANSCRIPT
Classification by search partition analysis: an alternative to tree methods in medical problems.
• Roger Marshall• School of Population Health• University of Auckland• New Zealand• [email protected]
Why classification?
Uses:
to develop diagnostic/prognostic decision and classification rules
“discover” homogeneous subgroups e.g at risk inepidemiology, who respond to treatment
Methods:
Regression methods (including neural networks) – model based
Trees – no model (unless hierarchical tree considered as such)
Empirical density methods–Smoothers of parameter space
Support Vector machines – find margins of maximum separation
Boolean classifiers (including SPAN, rough sets) – based on logical structures
Attractions of trees (in medicine):
regression models perceived as unrealistic
regression models based on arithmetic scores
trees demarcate individuals with “clusters” ofcharacteristics
closer affinity to clinical reasoning
Feinstein (circa 1971):
“….clinicians don’t think in terms of weighted averagesof clinical variables … they think about demarcated subgroups of people who possess combinations ofclinical attributes…..”
Suggested trees for “prognostic stratification”.
A^A
^OO
^SS
O
H
G^G
^O
^H
High risk of diabetes:A=age 45+, O=obese, H=hypertension, S-sedentary(Herman et al, Diabetes care 1995)
AO
A^OH
A^O^HG
^AOS
AO
H^H
C^C
P^P
U^U
V^V
B^B
High risk asthma. (Lieu et al, Am J Resp Care, 1998)H=regularly hospitalized, C=cromolyn, B=beta agonist, P=regular prescriptions, U=urgent admissions, V=regular visits
High risk if:(H C) ( ^H P U) ( ^H ^P BV).
Trees History:
1960-70’s AID (Sonquist and Morgan), CHAID, 1980’s CART, C4.5, “Machine Learning” 1990’s-- New ideas Bagging, boosting, Bayes, random forests
Software:
CART, S+, KnowledgeSeeker, C4.5 (C5?), SPSS AnswerTree, RECAMP, QUEST, CAL5,SAS macros, SAS Enterprise Miner, R rpart
Measure of best split: “Goodness of split”
e.g. statistical test based measures:chi-square (categorical y), t, F-statistics (y continous), log-rank for “survival trees”
eg. decrease in “impurity” by splitting(CART)
eg. Likelihood (deviance) statistics (Ripley, S+).
CART ideas on impurity:
Binary outcome classes D1 and D2 . Define an impurity measure i(t) of node t. p=proportion of D1 ‘s at node t
e.g Gini diversity impurity is i(t) = p(1-p)
e.g or Entropy measure:
i(t) = -p log p – (1-p) log(1-p)
Change in impurity by splitting of into left L and right R nodes node t is
i(t)-pLi(tL)-pRi(tR)
“Right size trees”
1. Stopping rules (subgroup sample size, P-values) (pre-pruning)
2. Grow big tree and prune (post pruning)
e.g MCCP (minimum cost complexity pruning)Complexity c= # terminal nodes
Use cross-validation to estimate prediction error.
Many other methods
New methods/extensions:
Multivariate trees (multivariate splits)
Multivariate y
Survival trees (Segal, LeBlanc)
Bagging/Boosting (Brieman)
Bayes trees (Buntine, Chipman)
Forests of Trees (Breiman)
Some Troubles with trees:
Net (main) effects not evaluated
Misleading decision rules
Rules hard to interpret
Simple rules probably missed
Tree itself as a model
A^A
^OO
^SS
O
H
G^G
^O
^H
High risk of diabetes:A=age 45+, O=obese, H=hypertension, S-sedentary(Herman et al, Diabetes care 1995)
AO
A^OH
A^O^HG
^AOS
Misleading classification rules
High risk of diabetes:
(A O) (^A O S) (A^O H) (A^O^H G)
which is same as
(A O) (O S) (A H) (A G)
i.e. ^A, ^O are redundant.
e.g. Don’t need to be young (^A) in conjunction with O, S
Simple rules may require complex trees(“replication problem”)
Eg. (A B) (C D)
A=age under 18, B=black, C= cigarette smoker and D=drinks alcohol. Needs a tree with 7 nodes.
Eg. ( A B C) (D E F) (G H K)
needs a tree with 80 terminal nodes!
Positive attributes
Usually (in medicine at least) an attribute can be considered in advance to be either “positively” or “negatively” associated with y
e.g obese, sedentary, hypertensive are “positive” fordiabetes.
e.g. Smoking, old, high cholesterol “positive” for ischaemia
eg. Presence of an adverse gene
“Regular” classification rules
Combinations of “positive” attributes only to definehigh risk
Tree rules not usually regular
(though occasionally may reduce to a regular rule,as in diabetes example).
H^H
C^C
P^P
U^U
V^V
B^B
High risk asthma. (Lieu et al, Am J Resp Care, 1998)H=regularly hospitalized, C=cromolyn, B=beta agonist, P=regular prescriptions, U=urgent admissions, V=regular visits
High risk if:(H C) ( ^H P U) ( ^H ^P BV)e.g ( ^H P U) not regularly hospitalized.
Tree model
Is the hierarchical tree “model” sensible?
Probably not
Even if it is……
…… does process of subdivisionestimate the best tree?
Maybe?
The considerations suggest:
Why not consider non-hierarchical procedures?
and
Why not focus on regular combinations directly?
SPAN attempts to
SPAN (Search Partition Analysis)
Generates regular decision rules of the form:
A=K1 or[and] K2 or[and] …Kq
where Ki is the conjunction[disjunction] of pi attributes.
Binary partitions of the predictor space into A and A-.
Non-hierarchical
SPAN Carries out a search to find best possible combinations of attributes
Unless search is somehow limited, becomes impossibly large!
e.g 22 -1 - 1
ways to form a binary partition for m attributes
e.g. 2147 million for m=5
m
How SPAN limits extent of search:
• By restricting to a set of m attributes Tm ={ X1,…Xm}
• Typically m<15. These may be the m “best” attributes
• By not allowing “mixed” combinations of attributes of those in Tm.
• By restricting complexity of Boolean expressions i.e pi and q parameters
Attribute set Tm
If the set of m attributes
Tm ={ X1,…Xm}
consists of attributes labelled “positive”, SPAN will generate only regular partitions.
Natural to consider the best ranked attributes
Extent of search for different parameters
Based on cominatoric formulae of “lock and key”algorithm forgenerating partitions.
search over T
Best partition Aj – attribute aj
Make T={Tm, aj}j=j+1
j=1, T=Tm
Produces a sequence of new attributes with increasingbetter discrimination (no proof of this assertion!):
Iteratedsearchprocedure
Continue untilAj=Aj+1
Complexity penalising
To avoid “overfitting” penalise “complex” Boolean expressionsHow to measure complexity? - by number of subgroups (minus 1).
3 subgroups in A, 2 in A- Complexity c =3 + 2 -1 =4
Penalise measure (eg entropy) G by G-c
Extension to >2 ordinal states. e.g Categories 0, 1, 2
Can find binary partition A2 of {0,1} v {2}
also A1 of {0} v {1,2}
C0
C1
C2
A1
A2
C0
C1
C2
A1
A2
Need to ensure A2 is subset of A1. Constrain search.
E.g diabetes 0=none, 1=imparied glucose tolerance, 2=diabetes
A2 = (FU)(FET)A1 = (FU)(FT)(FH)
F, T and U denote positive fructosamine, triglyceride and urinary albumin tests. E is ethnic Polynesian
Can be shown A2 subset of A1
Comparisons of SPAN and other methods
•Lim, Loh and Shih (Machine Learning) compared 33 methods on 32 data sets
•Methods – 22 tree, 9 “statistical”, 2 neural networks.
•16 data sets (plus 16 with added noise)
•Seems to provide benchmarks for other methods.
•I tried SPAN on the 24 2-state and 3-state classification data sets.
Data # classes
SPAN error
LLS Range of 33 methods
bcw 2 0.035 0.03-0.09
bcw+ 0.035 0.03-0.08
bld 2 0.365 0.28-0.43
bld+ 0.373 0.29-0.44
bos 3 0.236 0.221-0.314
bos+ 0.236 0.225-0.422
cmc 3 0.449 0.43-0.60
cmc+ 0.444 0.43-0.58
dna 3 0.075 0.05-0.38
dna+ 0.075 0.04-0.38
hea 2 0.170 0.14-0.34
hea+ 0.170 0.15-0.31
Data # classes
SPAN error
LLS Range of 33 methods
pid 2 0.251 0.22-0.31
pid+ 0.252 0.22-0.32
smo 3 0.305 (0.44) 0.30-0.45
smo+ 0.305(0.44) 0.31-0.45
tae 3 0.510 0.325-0.693
tae+ 0.701 0.445-0.696
thy 3 0.0134 0.005-0.89
thy+ 0.0134 0.01-0.88
vot 2 0.044 0.04-0.06
vot+ 0.044 0.04-0.07
wav 3 0.266 0.151-0.477
wav+ 0.266 0.160-0.446
Data POL SPAN QI0 LOG LDA IC0 RBF ST0
bcw 19 7 2.5 5.5 12 23 5.5 30
bcw+ 11 8 1.5 8 9 27 4 20
bld 3 26 7 9 18.5 20 24 10
bld+ 1 25 26 17 15 7 18 6
bos 2.5 6 28 11.5 14.5 18.5 2.5 22.5
bos+ 3 2 22.5 13 20 11 28 17.5
cmc 1 5 14 19 22 7.5 10 9
cmc+ 1 4 17 18.5 22.5 5 15.5 9
dna 2 21 18 16 12 10 31 10
dna+ 3 20 19 17 13.5 6 31 6
pid 16.5 31 5 11 1.5 16.5 11 16.5
pid+ 1 30 2.5 7 4 13.5 15 30
hea 9 7.5 4.5 6 1 18 11.5 30
hea+ 16 6 6 9 2 16 8 23
Data POL SPAN QI0 LOG LDA IC0 RBF ST0
smo 9.5 9.5(32) 9.5 9.5 9.5 26 18.5 9.5
smo+ 7.5 9.5(32) 7.5 16.5 21 7.5 23.3 7.5
tae 20 19 10 11 6 3 13 30.5
tae+ 16 11 9 4 2 20 10 32
thy 14.5 14.5 17 26 28 8 23 5.5
thy+ 15 15 17.5 25 27 10 29 7.5
vot 25.5 9 1 21 15 17.5 25.5 21
vot+ 16 5 21 5 16 26 21 19
wav 5 21 8.5 2 10.5 29 1 26
wav+ 6 21 9 3.5 7.5 27.5 31 26
Mean Rank
9.3 11.1(12.9)
11.8 12.1 12.9 15.6 17.1 17.7
Mean error
0.210 0.200(0.230)
0.219 0.215 0.216 0.223 0.249 0.247
Limitations/Criticisms
Multi-class problems difficult
“Data dredging”
Loss of information by “cutpoints” of continuous vars.
Complexity penalising somewhat ad hoc
Computationally intensive, unless search sensibly restricted
Not “black-box” – requires user judgements
Needs (temperamental!) SPAN software – no R algorithms