classification and regression trees - nutritionista · deviance/entropy/ information criterion ......
TRANSCRIPT
Construction of Tree
• Starting with the root, find best split at each node using an exhaustive search, i.e. for each variable Xi generate all possible splits and compute homogeneity, select best split for best variable
Gini Index
• probabilistic view:for each node i we have class probabilities pik (with sample size ni)
• DefinitionG(i) =
K�
k=1
pik(1− pik)
pik =1
ni
ni�
i=1
I(Yi = k)with
Deviance/Entropy/Information Criterion
• Entropy
• Deviance D(i) = −2 ·K�
k=1
nik log pik
For node i:
E(i) = −K�
k=1
pik log pik
Find best split
• For a split at X = x0, we get Y1 and Y2 of lengths n1 and n2, respectively:
g(Y ) = 1−k�
i=1
p2i
g(Y |X = xo) =1
n1 + n2(n1g(Y1) + n2g(Y2))
f(happy) and f(sex)
f(happy | sex)
f(sex | happy)
f(sex, happy)
f(x, y) = f(x | y)f(y)
1
n
�
i
(yi − yi)2
log θij = logmi+1,j+1mij
mi+1,jmi,j+1= ... = βi+1 − βi
log θij = logmi+1,j+1mij
mi+1,jmi,j+1= ... = β
k =(s21 + s22)
2
s41/(n1 − 1) + s42/(n2 − 1)
X1 − X2 − d�s21/n1 + s22/n2
X − µ0
s/√n
p− p0�p(1− p)/n
p1 − p2 − d�p1(1− p1)/n1 + p2(1− p2)/n2
λXYij = βuivj
λXYij = βivj
1
Delayed?
|Distance< 2459
Distance>=728 Distance>=4228
0.03486 0.05107
0 0.5
Distance
Delayed
FALSE
TRUE
1000 2000 3000 4000
Delayed
FALSE
TRUE
Distance
count
0.0
0.2
0.4
0.6
0.8
1.0
0 1000 2000 3000 4000
Delayed
FALSE
TRUE
Gini Index of Homogeneity
ginis
0.0
0.1
0.2
0.3
0.4
1000 2000 3000 4000
split data at maximum, then repeat
with each subset
Some Stopping Rules
• Nodes are homogeneous “enough”
• Nodes are small (ni < 20 rpart, ni < 10 tree)
• “Elbow” criterion: gain in homogeneity levels out
• Minimize Error (e.g. cross-validation)
• Minimize cost-complexity measure: Ra = R + a*size(R homogeneity measure evaluated at leaves, a>0 real value, penalty term for complexity)
Diagnostics
• Prediction error (in training/testing scheme)
• misclassification matrix (for categorical response)
• loss matrix: adjust corresponding to risk - are all types of misclassifications similar?- in binary situation: is false positive as bad as false negative?
Random Forests
• Breiman (2001), Breiman & Cutler (2004)
• Tree Ensemble built by randomly sampling cases and variables
• Each case classified once for each tree in the ensemble
• Overall values determined by ‘voting’ for category or (weighted) averaging in case of continuous response
• Random Forests apply a Bootstrap Aggregating Technique
Bootstrapping
• Efron 1982
• ‘Pull ourselves out of the swamp by our shoe laces(=bootstraps)’
• We have one dataset which gives us one specific statistic of interest.We do not know the distribution of the statistic.
• The idea is that we use the data to ‘create’ a distribution.
Bootstrapping
• Resampling Technique:from a dataset D of size n sample n times with replacement to get D1, and again to get D2, D3, ..., DM for some fairly large M.
• Compute statistic of interest for each Di
This yields a distribution against which we can compare the original value.
Example: Law Schools
• average GPA and LSAT for admission from 15 law schools
• What is correlation between GPA and LSAT?
• cor (LSAT, GPA) = 0.78What would be a confidence interval for this?
LSAT
GPA
280
290
300
310
320
330
340
560 580 600 620 640 660
Percentile Bootstrap CI
(1) Sample with replacement from the data(2) Compute correlation(3) Repeat M=1000 times
(4) Get an a 100% confidence interval by excluding the top a/2 and bottom a/2 percent values
Percentile Bootstrap CI
(1) Sample with replacement from the data(2) Compute correlation(3) Repeat M=1000 times
(4) Get an a 100% confidence interval by excluding the top a/2 and bottom a/2 percent values
> quantile(cors, probs=c(0.025, 0.975)) 2.5% 97.5% 0.4478862 0.9605759
> summary(cors) Min. 1st Qu. Median Mean 3rd Qu. Max. -0.02534 0.68990 0.79350 0.76910 0.88130 0.99390
Bootstrap Results
cors
count
0
20
40
60
80
0.0 0.2 0.4 0.6 0.8 1.0
M=1000
> quantile(cors, probs=c(0.025, 0.975)) 2.5% 97.5% 0.4654083 0.9629974
cors
count
0
100
200
300
0.2 0.4 0.6 0.8 1.0
M=5000
> quantile(cors, probs=c(0.025, 0.975)) 2.5% 97.5% 0.4478862 0.9605759
Influence of size of M
• for M ≥ 1000 the estimates of the correlation coefficient look reasonable.
M
cor
0.72
0.74
0.76
0.78
0.80
10000 20000 30000 40000 50000
Compare to all 82 Schools
• Unique situation: here, we have data on the whole population (of all 82 law schools)
LSAT
100
* G
PA
260
280
300
320
340
500 550 600 650 700
actual population value of correlation
is 0.76
Limitations of Bootstrap
• Bootstrap approaches are not good in boundary situations, e.g. finding min or max:
• Assume u ~ U[0,ß]estimate of ß: max(u)
> summary(bhat) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.867 1.978 1.995 1.986 1.995 1.995
> quantile(bhat, probs=c(0.025, 0.975)) 2.5% 97.5% 1.956715 1.994739
Bootstrap Estimates
• Percentile CI:works well if bootstrap distribution is symmetric and centered on the observed statistic.if not, underestimates variability (same happens if sample is very small < 50)
• other approaches: Basic Bootstrap, Studentized Bootstrap, Bias-Corrected Bootstrap, Accelerated Bootstrap