hw4 isl exercises, chapter 5 comments, solutionsmason.gmu.edu/~jgentle/gt/hw4.pdfhw4 isl exercises,...

HW4 ISL Exercises, Chapter 5

Comments, Solutions

ISYE67400; GT Shenzhen 2018 Fall James E. Gentle

5.2 Bootstrap samples from a set of n observations.

5.2(a) There are n observations, so the probability that any one of them is drawn first is 1/n; hence,the probability that any given on is not the one drawn first is (n − 1)/n.

5.2(b) For the same reasons as above, the probability that any given on is not the second one drawnis (n − 1)/n.

5.2(c) The draws are independent, so for a particular observation not to be drawn in any of the ndraws is the probability that it is not drawn first, times the probability that it is not drawnsecond, times the probabilities that it is not drawn on any of the subsequent draws; that is((n − 1)/n)n = (1 − 1/n)n.

5.2(d) If n = 5, the probability that a given observation is not in the bootstrap sample is 0.85, so theprobability that it is in the sample is 1 − 0.85 = 0.67232.

5.2(e) If n = 100, the probability that a given observation is in the bootstrap sample is 1− 0.99100 =0.6339677.

5.2(f ) If n = 10, 000, the probability that a given observation is in the bootstrap sample is 1 −0.999910000 = 0.632139. (R actually handles those big numbers; it exponentiates using logs.)

5.2(g) This would appear to be a numerically unstable process; R, however, handles it. The limitinvolved is

limn→∞

(

1 −1

n

)

n

= e−1

This is a very large range, and the convergence is very rapid relative to 100,000. A more meaningfulplot would be for n up to 100. It’s very close to its limiting value at 100.

N <- 100000

n <- 1:N

p <- 1-((n-1)/n)^n

plot(n,p)

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0.7

0.8

0.9

1.0

n

p

5.2(h) This is a computational investigation of the proportion of times that 4 occurs at least once ina sample of size 100 from the set {1, 2, . . . , 100}.

1

> n <- 10000

> store <- rep(NA, n)

> set.seed(555)

> for (i in 1:n){

+ store[i] <- sum(sample(1:100, rep=TRUE)==4)>0

+ }

> mean(store)

[1] 0.6387

5.3 k-fold cross-validation.

5.3(a) Cross validation is a systematic process. In k-fold cross validation, we divide the dataset intok approximately equal subsets. Then we leave out the first subset, fit the model on the remainingk − 1 subsets, and compute the error in the first set using that fit.

5.3(b) i. The single validation set approach is very simple, and requires fewer computations. On theother hand, this approach does not use all of the data in fitting the model; therefore, the fit isnot as good as it could be. The results of estimating the MSE from the validation set are biasedupward because the fit is based on a different dataset. Furthermore, this approach has largevariance because of the selection of validation and training sets; another division of the originaldata would yield different results.

5.3(b) ii. The LOOCV approach is more computationally intensive; it performs n separate fits. Theupward bias of LOOCV is smaller, but LOOCV has a larger variance than might be expectedbecause although it is an average of several separate values (which should reduce the variance),those values have large covariances due to the fact that there is such a large overlap in theindividual training sets.

5.4 We might try a cross validation approach on this, but a bootstrap would be better. In the bootstrapmethod, we would generate, say, 1,000 bootstrap samples, fit the model in each case, and compute thepredicted response for each. We would then compute the standard deviation of the 1,000 predictions.This would be the bootstrap estimate of the standard deviation of the predicted value.

5.5 5.5(a) > library(ISLR)

> attach(Default)

> logfit <- glm(default ~ income+balance, family="binomial")

> summary(logfit)

Call:

glm(formula = default ~ income + balance, family = "binomial")

Deviance Residuals:

Min 1Q Median 3Q Max

-2.4725 -0.1444 -0.0574 -0.0211 3.7245

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***

income 2.081e-05 4.985e-06 4.174 2.99e-05 ***

balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2920.6 on 9999 degrees of freedom

Residual deviance: 1579.0 on 9997 degrees of freedom

AIC: 1585

Number of Fisher Scoring iterations: 8

2

5.5(b) There are 10,000 observations. We will randomly form a validation set from the full set. Theexamples in the text took half and half for validation and training. It is usually better to putmore observations into the training set. I will put 2,500 into the validation set and the remaining7,500 observations will form the training set.

5.5(b) i.

library(ISLR)

attach(Default)

n <- length(Default$default)

set.seed(555)

valind <- sort(sample(1:n, n/4)) # sorted for convenience

Default.val <- Default[valind, ]

Default.training <- Default[-valind, ]

5.5(b) ii.

> logfit.training <- glm(default ~ income+balance,

+ data=Default.training, family="binomial")

> summary(logfit.training)

Call:

glm(formula = default ~ income + balance, family = "binomial",

data = Default.training)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.4440 -0.1496 -0.0609 -0.0227 3.6846

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.134e+01 4.887e-01 -23.204 < 2e-16 ***

income 2.243e-05 5.619e-06 3.992 6.56e-05 ***

balance 5.499e-03 2.548e-04 21.580 < 2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2198.9 on 7499 degrees of freedom

Residual deviance: 1213.8 on 7497 degrees of freedom

AIC: 1219.8

Number of Fisher Scoring iterations: 8

5.5(b) iii.

pred.val.prob <- predict(logfit.training, Default.val, type="response")

pred.val <- ifelse(pred.val.prob>0.5,"Yes","No")

5.5(b) iv.

> mean(pred.val != Default.val$default)

[1] 0.026

The error rate in the validation set is 2.6%. (This is smaller than I would have expected.)

5.5(c) We can just use the same code above, because if the seed is not reset, it just goes on to new”random numbers”.

> errors <- numeric(3)

> for (i in 1:3) {

+ valind <- sort(sample(1:n, n/4))

+ Default.val <- Default[valind, ]

+ Default.training <- Default[-valind, ]

3

+ logfit.training <- glm(default ~ income+balance,

+ data=Default.training, family="binomial")

+ pred.val.prob <- predict(logfit.training, Default.val, type="response")

+ pred.val <- ifelse(pred.val.prob>0.5,"Yes","No")

+ errors[i] <- mean(pred.val != Default.val$default)

+ }

> mean(errors)

[1] 0.02746667

> sd(errors)

[1] 0.00244404

The mean of the three trials was 2.75%, with a standard deviation of 2.44%. This estimatedclassification error seems to be a good estimate.

5.5(d) I will just repeat was was done before.

set.seed(555)

errors1 <- numeric(3)

errors2 <- numeric(3)

for (i in 1:3) {

valind <- sort(sample(1:n, n/4))

Default.val <- Default[valind, ]

Default.training <- Default[-valind, ]

logfit1.training <- glm(default ~ income+balance,

data=Default.training, family="binomial")

pred.val1.prob <- predict(logfit.training, Default.val, type="response")

pred.val1 <- ifelse(pred.val1.prob>0.5,"Yes","No")

errors1[i] <- mean(pred.val1 != Default.val$default)

logfit2.training <- glm(default ~ income+balance+student,

data=Default.training, family="binomial")

pred.val2.prob <- predict(logfit.training, Default.val, type="response")

pred.val2 <- ifelse(pred.val2.prob>0.5,"Yes","No")

errors2[i] <- mean(pred.val2 != Default.val$default)

}

mean(errors1)

mean(errors2)

I got no difference to 3 significant digits. That predictor does not seem to help. If there had beena difference, I would look at things such as AIC to decide between the models. (Actually the AICwas slightly lower for the model with the student predictor, but there is no practical significance.)

4

hw4 isl exercises, chapter 5 comments, solutionsmason.gmu.edu/~jgentle/gt/hw4.pdfhw4 isl exercises,...

Documents