classiﬁcation and regression trees - nutritionista · deviance/entropy/ information criterion ......

Classification And Regression Trees

Stat 430

Outline

• More Theoretical Aspects of

• tree algorithms

• random forests

• Intro to Bootstrapping

Construction of Tree

• Starting with the root, find best split at each node using an exhaustive search, i.e. for each variable Xi generate all possible splits and compute homogeneity, select best split for best variable

Gini Index

• probabilistic view:for each node i we have class probabilities pik (with sample size ni)

• DefinitionG(i) =

K�

k=1

pik(1− pik)

pik =1

ni

ni�

i=1

I(Yi = k)with

Deviance/Entropy/Information Criterion

• Entropy

• Deviance D(i) = −2 ·K�

k=1

nik log pik

For node i:

E(i) = −K�

k=1

pik log pik

Find best split

• For a split at X = x0, we get Y1 and Y2 of lengths n1 and n2, respectively:

g(Y ) = 1−k�

i=1

p2i

g(Y |X = xo) =1

n1 + n2(n1g(Y1) + n2g(Y2))

f(happy) and f(sex)

f(happy | sex)

f(sex | happy)

f(sex, happy)

f(x, y) = f(x | y)f(y)

1

n

�

i

(yi − yi)2

log θij = logmi+1,j+1mij

mi+1,jmi,j+1= ... = βi+1 − βi

log θij = logmi+1,j+1mij

mi+1,jmi,j+1= ... = β

k =(s21 + s22)

2

s41/(n1 − 1) + s42/(n2 − 1)

X1 − X2 − d�s21/n1 + s22/n2

X − µ0

s/√n

p− p0�p(1− p)/n

p1 − p2 − d�p1(1− p1)/n1 + p2(1− p2)/n2

λXYij = βuivj

λXYij = βivj

1

Delayed?

|Distance< 2459

Distance>=728 Distance>=4228

0.03486 0.05107

0 0.5

Distance

Delayed

FALSE

TRUE

1000 2000 3000 4000

Delayed

FALSE

TRUE

Distance

count

0.0

0.2

0.4

0.6

0.8

1.0

0 1000 2000 3000 4000

Delayed

FALSE

TRUE

Gini Index of Homogeneity

ginis

0.0

0.1

0.2

0.3

0.4

1000 2000 3000 4000

split data at maximum, then repeat

with each subset

Some Stopping Rules

• Nodes are homogeneous “enough”

• Nodes are small (ni < 20 rpart, ni < 10 tree)

• “Elbow” criterion: gain in homogeneity levels out

• Minimize Error (e.g. cross-validation)

• Minimize cost-complexity measure: Ra = R + a*size(R homogeneity measure evaluated at leaves, a>0 real value, penalty term for complexity)

Diagnostics

• Prediction error (in training/testing scheme)

• misclassification matrix (for categorical response)

• loss matrix: adjust corresponding to risk - are all types of misclassifications similar?- in binary situation: is false positive as bad as false negative?

Random Forests

• Breiman (2001), Breiman & Cutler (2004)

• Tree Ensemble built by randomly sampling cases and variables

• Each case classified once for each tree in the ensemble

• Overall values determined by ‘voting’ for category or (weighted) averaging in case of continuous response

• Random Forests apply a Bootstrap Aggregating Technique

Bootstrapping

• Efron 1982

• ‘Pull ourselves out of the swamp by our shoe laces(=bootstraps)’

• We have one dataset which gives us one specific statistic of interest.We do not know the distribution of the statistic.

• The idea is that we use the data to ‘create’ a distribution.

Bootstrapping

• Resampling Technique:from a dataset D of size n sample n times with replacement to get D1, and again to get D2, D3, ..., DM for some fairly large M.

• Compute statistic of interest for each Di

This yields a distribution against which we can compare the original value.

Example: Law Schools

• average GPA and LSAT for admission from 15 law schools

• What is correlation between GPA and LSAT?

• cor (LSAT, GPA) = 0.78What would be a confidence interval for this?

LSAT

GPA

280

290

300

310

320

330

340

560 580 600 620 640 660

Percentile Bootstrap CI

(1) Sample with replacement from the data(2) Compute correlation(3) Repeat M=1000 times

(4) Get an a 100% confidence interval by excluding the top a/2 and bottom a/2 percent values

Percentile Bootstrap CI

(1) Sample with replacement from the data(2) Compute correlation(3) Repeat M=1000 times

(4) Get an a 100% confidence interval by excluding the top a/2 and bottom a/2 percent values

> quantile(cors, probs=c(0.025, 0.975)) 2.5% 97.5% 0.4478862 0.9605759

> summary(cors) Min. 1st Qu. Median Mean 3rd Qu. Max. -0.02534 0.68990 0.79350 0.76910 0.88130 0.99390

Bootstrap Results

cors

count

0

20

40

60

80

0.0 0.2 0.4 0.6 0.8 1.0

M=1000


cors

count

0

100

200

300

0.2 0.4 0.6 0.8 1.0

M=5000


Influence of size of M

• for M ≥ 1000 the estimates of the correlation coefficient look reasonable.

M

cor

0.72

0.74

0.76

0.78

0.80

10000 20000 30000 40000 50000

Compare to all 82 Schools

• Unique situation: here, we have data on the whole population (of all 82 law schools)

LSAT

100

* G

PA

260

280

300

320

340

500 550 600 650 700

actual population value of correlation

is 0.76

Limitations of Bootstrap

• Bootstrap approaches are not good in boundary situations, e.g. finding min or max:

• Assume u ~ U[0,ß]estimate of ß: max(u)

> summary(bhat) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.867 1.978 1.995 1.986 1.995 1.995

> quantile(bhat, probs=c(0.025, 0.975)) 2.5% 97.5% 1.956715 1.994739

Bootstrap Estimates

• Percentile CI:works well if bootstrap distribution is symmetric and centered on the observed statistic.if not, underestimates variability (same happens if sample is very small < 50)

• other approaches: Basic Bootstrap, Studentized Bootstrap, Bias-Corrected Bootstrap, Accelerated Bootstrap

Bootstrapping in R

• packages boot, bootstrap

classiﬁcation and regression trees - nutritionista · deviance/entropy/ information criterion ......

Documents