tree-based methods (v&r 9.1) demeke kasaw, andreas nguyen, mariana alvaro stat 6601 project
TRANSCRIPT
Tree-Based MethodsTree-Based Methods(V&R 9.1)(V&R 9.1)
Demeke Kasaw, Andreas Nguyen, Mariana Alvaro
STAT 6601 Project
Overview of Tree-based Methods
What are they? How do they work? Examples…
Tree pictorials common. Simple way to depict
relationships in data Tree-based methods use
this pictorial to represent relationships between random variables.
Trees can be used for bothClassification and Regression
Presence of Surgery Complicationsvs. Patient Age and Treatment Start Date
|
Start >= 8.5 months
Start >= 14.5
Age < 12 yrs
Sex = F
Start < 8.5
Start < 14.5
Age >= 12 yrs
Sex = M
Absent
Time to Next Eruptionvs. Length of Last Eruption
|Last Eruption < 3.0 min
Last Eruption < 4 .1 min 54.49
76.83 81.18
Absent
Absent Present
Present
General Computation Issues and Unique Solutions
Over-Fitting: When do we stop splitting? Stop generating new nodes when subsequent splits only result in little improvement.
Evaluate the quality of the prediction: Prune the tree to ideally select the simplest most accurate solution.Methods:– Crossvalidation: Apply the tree computed from one set of
observations (learning sample) to another completely independent set of observations (testing sample).
– V-fold crossvalidation: Repeat the analysis with different randomly drawn samples from the data. Use the tree that shows the best average accuracy for cross-validated predicted classifications or predicted values.
Computational Details
Specify the criteria for predictive accuracy– Minimum costs: Lowest misclassification rate– Case weights
Selecting Splits– Define a measure of impurity for a node. A node is “pure” if they
contain observations of a single class. Determine when to stop splitting
– All nodes are pure or contain no more than a n cases– Until all nodes contain no more than a specified Fraction of Objects
Selecting the “right-size” tree– Test sample cross validation– V Fold cross validation– Tree selection after pruning: if there are several trees with costs
close to minimum, select the smallest-sized (least complex)
Computational Formulas
Estimation of Accuracy in Classification Trees– Resubstitution estimate
d(x) is the classifierX=1 if X(d(xn) = jn) is trueX =0 if X(d(xn) = jn) is false
Estimation of Accuracy in Regression Trees– Resubstitution estimate
))((1
)(1∑=
≠=N
inn jxdX
NdR
€
R(d) =1
N(y i − d(x i
i=1
N
∑ ))2
Computational Formulas Estimation of Node Impurity
Gini Index– Reaches zero when only one class is present at a node– P(j/t): probability of category j at node t
Entropy or Information
€
g(t) = p( j /t)p(i /t)j≠i
∑
€
pik * log pik∑
Classification Tree Example:What species are these flowers?
Sepal LengthSepal Width
Petal LengthPetal Width
Versicolor
Virginica
Setosa
tree
Iris Classification Data
Iris dataset relates species to petal and sepal dimensions reported in centimeters. Originally used by R.A. Fisher and E. Anderson for a discriminant analysis example.
Data is pre-packaged in R dataset library and is available on DASYL.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6.7 3.0 5.0 1.7 versicolor
5.8 2.7 3.9 1.2 versicolor
7.3 2.9 6.3 1.8 virginica
5.2 4.1 1.5 0.1 setosa
4.4 3.2 1.3 0.2 setosa
Iris ClassificationMethod and Code
library(rpart) # Load tree fitting packagedata(iris) # Load iris data
# Let x = tree object fitting Species vs. all other# variables in iris with 10-fold cross validationx = rpart(Species~.,iris,xval=10)
# Plot tree diagram with uniform spacing,# diagonal branches, a 10% margin, and a titleplot(x, uniform=T, branch=0, margin=0.1, main="Classification Tree\nIris Species by Petal and Sepal Length")
# Add labels to tree with final counts,# fancy shapes, and blue text colortext(x,use.n=T,fancy=T,col="blue")
Results:
Classification TreeIris Species by Petal and Sepal Length
Petal.Length < 2 .45
Petal.Width < 1 .75
Petal.Length >= 2 .45
Petal.Width >= 1 .75
setosa 50/0/0
versicolor0/49/5
virginica 0/1/45
Tree-based approach much simpler than the alternative
Classification with Cross-validation True GroupPut into Group setosa versicolor virginicasetosa 50 0 0versicolor 0 48 1virginica 0 2 49Total N 50 50 50N correct 50 48 49Proportion 1.000 0.960 0.980N = 150 N Correct = 147
Linear Discriminant Function for Groups setosa versicolor virginicaConstant -85.21 -71.75 -103.27Sepal.Length 23.54 15.70 12.45Sepal.Width 23.59 7.07 3.69Petal.Length -16.43 5.21 12.77Petal.Width -17.40 6.43 21.08
Identify this flower…Sepal Length 6Sepal Width 3.4Petal Length 4.5Petal Width 1.6
Setosa -85+24*6+24*3.4-16*4.5-17*1.6=41
Versicolor -72+16*6+7*3.4+5*4.5+6*1.6=80
Virginica -103+12*6+4*3.4+13*4.5+21*1.6=75
Since Versicolor has highest score,we classify this flower as an Iris versicolor.
Classification TreeIris Species by Petal and Sepal Length
PetalLength< 2 .45
PetalWidth< 1 .75
PetalLength>= 2 .45
PetalWidth
>= 1 .75
setosa 50/0/0
versicolor0/49/5
virginica 0/1/45
Regression Tree Example
Software used : R, rpart package
Goal:– Applying the regression tree method on CPU data,
and predicting the response variable, ‘performance’.
CPU Data
CPU performance of 209 different processors.
name syct mmin mmax cach chmin chmax perf
1 ADVISOR 32/60 125 256 6000 256 16 128 198
2 AMDAHL 470V/7 29 8000 32000 32 8 32 269
3 AMDAHL 470/7A 29 8000 32000 32 8 32 220
4 AMDAHL 470V/7B 29 8000 32000 32 8 32 172
5 AMDAHL 470V/7C 29 8000 16000 32 8 16 132
6 AMDAHL 470V/8 26 8000 32000 64 8 32 318
...
Mem
ory (
kb)
System
Spe
ed
(mhz
)
Cache
(kb)
Chann
els
Perfor
mance
Bench
mark
R Code
library(MASS); library(rpart); data(cpus); attach(cpus)
# Fit regression tree to datacpus.rp <-rpart(log(perf)~.,cpus[,2:8],cp=0.001)
# Print and plot complexity Parameter (cp) tableprintcp(cpus.rp); plotcp(cpus.rp)
# Prune and display tree cpus.rp<-prune(cpus.rp,cp=0.0055)plot(cpus.rp,uniform=T,main="Regression Tree")text(cpus.rp,digits=3)
# Plot residual vs. predictedplot(predict(cpus.rp),resid(cpus.rp)); abline(h=0)
Determine the Best Complexity Parameter (cp) Value for the Model
CP nsplit rel error xerror xstd1 0.5492697 0 1.00000 1.00864 0.0968382 0.0893390 1 0.45073 0.47473 0.0482293 0.0876332 2 0.36139 0.46518 0.0467584 0.0328159 3 0.27376 0.33734 0.0328765 0.0269220 4 0.24094 0.32043 0.0315606 0.0185561 5 0.21402 0.30858 0.0301807 0.0167992 6 0.19546 0.28526 0.0280318 0.0157908 7 0.17866 0.27781 0.0276089 0.0094604 9 0.14708 0.27231 0.02878810 0.0054766 10 0.13762 0.25849 0.02697011 0.0052307 11 0.13215 0.24654 0.02629812 0.0043985 12 0.12692 0.24298 0.02717313 0.0022883 13 0.12252 0.24396 0.02702314 0.0022704 14 0.12023 0.24256 0.02706215 0.0014131 15 0.11796 0.24351 0.02724616 0.0010000 16 0.11655 0.24040 0.026926
1 – R2
Cross-Validated
Error
cp
X-v
al
Re
lati
ve
Err
or
0.2
0.4
0.6
0.8
1.0
1.2
Inf 0.03 0.0072 0.0012
1 3 5 7 11 14 17
size of tree
# SplitsComplexityParameter
Cross-Validated Error SD
Regression Tree
Regression TreeBefore Pruning
|cach< 27
mmax< 6100
mmax< 1750
mmax< 2500
chmax< 4.5
syct< 110
syct>=360
chmin< 5.5
cach< 0.5
chmin>=1.5
mmax< 1.4e+04
mmax< 2.8e+04
cach< 96.5
mmax< 1.124e+04
chmax< 14
cach< 56
2.51
3.05
3.12
3.263.54
2.95
3.52
3.89
4.044.31
4.554.21
4.695.14
5.355.226.14
Regression TreeAfter Pruning
|cach< 27
mmax< 6100
mmax< 1750 syct>=360
chmin< 5.5
cach< 0.5
mmax< 2.8e+04
cach< 96.5
mmax< 1.1e+04
cach< 56
2.51 3.292.95
3.52 4.03
4.55
4.21 4.92
5.35
5.22 6.14
Summary
Advantages of C & RT Simplicity of results:
– The interpretation of results summarized in a tree is very simple.
– This simplicity is useful for purposes of rapid classification of new observations
– It is much easier to evaluate just one or two logical conditions.
Tree methods are nonparametric and nonlinear – There is no implicit assumption that the underlying
relationships between the predictor variables and the dependent variable are linear, follow some specific non-linear link function
References
Venables, Ripley (2002), Modern Applied Statistics with S,251-266. StatSoft (2003) “Classification and Regression Trees”, Electronic
Textbook, StatSoft, 2003, retrieved on 11/8/2004 from http://www.statsoft.com/textbook/stcart.html
Fisher, R. A. (1936) “The use of multiple measurements in taxonomic problems”. Annals of Eugenics, 7, Part II, 179-188.
Using Trees in R (the 30 second version)1)Load the rpart librarylibrary(rpart)
2)For classification trees, make sure the response is of the type factor. If you don’t know how to do this lookup help(as.factor)or consult a general R reference.y=as.factor(y)
3)Fit the tree modelf=rpart(y~x1+x2+…,data=…,cp=0.001)If using an unattached dataframe, you must specify data.If using global variables, then data= can be omitted.A good starting point for cp, which controls the complexity of the tree, is given.
4)Plot and check the modelplot(f,uniform=T,margin=0.1); text(f,use.n=T)plotcp(f); printcp(f)Look at the xerrors in the summary and choose the smallest number of splits that achieve the smallest xerror. Consider the tradeoff between model fit and complexity (ie overfitting). Based on your judgement, repeat step 3 with the cp value of your choice.
5)Predict resultspredict(f,newdata,type=“class”)where newdata is a dataframe with the independent variables.
Using Trees in R (the 30 second version)1)Load the rpart librarylibrary(rpart)
2)For classification trees, make sure the response is of the type factor. If you don’t know how to do this lookup help(as.factor)or consult a general R reference.y=as.factor(y)
3)Fit the tree modelf=rpart(y~x1+x2+…,data=…,cp=0.001)If using an unattached dataframe, you must specify data.If using global variables, then data= can be omitted.A good starting point for cp, which controls the complexity of the tree, is given.
4)Plot and check the modelplot(f,uniform=T,margin=0.1); text(f,use.n=T)plotcp(f); printcp(f)Look at the xerrors in the summary and choose the smallest number of splits that achieve the smallest xerror. Consider the tradeoff between model fit and complexity (ie overfitting). Based on your judgement, repeat step 3 with the cp value of your choice.
5)Predict resultspredict(f,newdata,type=“class”)where newdata is a dataframe with the independent variables.
Using Trees in R (the 30 second version)1)Load the rpart librarylibrary(rpart)
2)For classification trees, make sure the response is of the type factor. If you don’t know how to do this lookup help(as.factor)or consult a general R reference.y=as.factor(y)
3)Fit the tree modelf=rpart(y~x1+x2+…,data=…,cp=0.001)If using an unattached dataframe, you must specify data.If using global variables, then data= can be omitted.A good starting point for cp, which controls the complexity of the tree, is given.
4)Plot and check the modelplot(f,uniform=T,margin=0.1); text(f,use.n=T)plotcp(f); printcp(f)Look at the xerrors in the summary and choose the smallest number of splits that achieve the smallest xerror. Consider the tradeoff between model fit and complexity (ie overfitting). Based on your judgement, repeat step 3 with the cp value of your choice.
5)Predict resultspredict(f,newdata,type=“class”)where newdata is a dataframe with the independent variables.
Using Trees in R (the 30 second version)1)Load the rpart librarylibrary(rpart)
2)For classification trees, make sure the response is of the type factor. If you don’t know how to do this lookup help(as.factor)or consult a general R reference.y=as.factor(y)
3)Fit the tree modelf=rpart(y~x1+x2+…,data=…,cp=0.001)If using an unattached dataframe, you must specify data.If using global variables, then data= can be omitted.A good starting point for cp, which controls the complexity of the tree, is given.
4)Plot and check the modelplot(f,uniform=T,margin=0.1); text(f,use.n=T)plotcp(f); printcp(f)Look at the xerrors in the summary and choose the smallest number of splits that achieve the smallest xerror. Consider the tradeoff between model fit and complexity (ie overfitting). Based on your judgement, repeat step 3 with the cp value of your choice.
5)Predict resultspredict(f,newdata,type=“class”)where newdata is a dataframe with the independent variables.