stad29 / sta 1007 assignment 5butler/d29/a5s.pdf · ## model resid. df resid. dev test df lr stat....

21
STAD29 / STA 1007 assignment 5 Due Wednesday February 25 2:00pm A reminder that allowable sources for help on this assignment are only : the instructor my textbook your class notes Table of points for each question: Question: 1 2 3 4 Total Points: 17 6 25 27 75 This seems to be a rather long assignment. I’ll try to give you a break next time, but you do have two weeks (including reading week) to do this one. 1. A phone company commissioned a survey of their customers’ satisfaction with their mobile devices. The responses to the survey were on a so-called Likert scale of “very unsatisfied”, “unsatisfied”, “satisfied”, “very satisfied”. Also recorded were each customer’s gender and age group (under 18, 18–24, 25–30, 31 or older). (A survey of this kind cannot ask its respondents for their exact age, only which age group they fall in.) The data, as frequencies of people falling into each category combination, are in http://www.utsc.utoronto.ca/ ~ butler/d29/mobile.txt. (a) (4 marks) Read in the data and take a look at the format. Use a tool that you know about to arrange the frequencies in one column, with other columns labelling the response categories that the frequencies belong to. Save the new data frame. (Take a look at it if you like.) Solution: mobile=read.table("mobile.txt",header=T) mobile ## gender age.group very.unsat unsat sat very.sat ## 1 male 0-17 3 9 18 24 ## 2 male 18-24 6 13 16 28 ## 3 male 25-30 9 13 17 20 ## 4 male 31+ 5 7 16 16 ## 5 female 0-17 4 8 11 25 ## 6 female 18-24 8 14 20 18 ## 7 female 25-30 10 15 16 12 ## 8 female 31+ 5 14 12 8 With multiple columns that are all frequencies, this is a job for gather out of tidyr: 1

Upload: others

Post on 14-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

STAD29 / STA 1007 assignment 5

Due Wednesday February 25 2:00pm

A reminder that allowable sources for help on this assignment are only :

• the instructor

• my textbook

• your class notes

Table of points for each question:

Question: 1 2 3 4 Total

Points: 17 6 25 27 75

This seems to be a rather long assignment. I’ll try to give you a break next time, but you do have twoweeks (including reading week) to do this one.

1. A phone company commissioned a survey of their customers’ satisfaction with their mobile devices. Theresponses to the survey were on a so-called Likert scale of “very unsatisfied”, “unsatisfied”, “satisfied”,“very satisfied”. Also recorded were each customer’s gender and age group (under 18, 18–24, 25–30,31 or older). (A survey of this kind cannot ask its respondents for their exact age, only which agegroup they fall in.) The data, as frequencies of people falling into each category combination, are inhttp://www.utsc.utoronto.ca/~butler/d29/mobile.txt.

(a) (4 marks) Read in the data and take a look at the format. Use a tool that you know about toarrange the frequencies in one column, with other columns labelling the response categories thatthe frequencies belong to. Save the new data frame. (Take a look at it if you like.)

Solution:

mobile=read.table("mobile.txt",header=T)

mobile

## gender age.group very.unsat unsat sat very.sat

## 1 male 0-17 3 9 18 24

## 2 male 18-24 6 13 16 28

## 3 male 25-30 9 13 17 20

## 4 male 31+ 5 7 16 16

## 5 female 0-17 4 8 11 25

## 6 female 18-24 8 14 20 18

## 7 female 25-30 10 15 16 12

## 8 female 31+ 5 14 12 8

With multiple columns that are all frequencies, this is a job for gather out of tidyr:

1

Page 2: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

library(tidyr)

mobile.long=gather(mobile,satisfied,frequency,very.unsat:very.sat)

mobile.long

## gender age.group satisfied frequency

## 1 male 0-17 very.unsat 3

## 2 male 18-24 very.unsat 6

## 3 male 25-30 very.unsat 9

## 4 male 31+ very.unsat 5

## 5 female 0-17 very.unsat 4

## 6 female 18-24 very.unsat 8

## 7 female 25-30 very.unsat 10

## 8 female 31+ very.unsat 5

## 9 male 0-17 unsat 9

## 10 male 18-24 unsat 13

## 11 male 25-30 unsat 13

## 12 male 31+ unsat 7

## 13 female 0-17 unsat 8

## 14 female 18-24 unsat 14

## 15 female 25-30 unsat 15

## 16 female 31+ unsat 14

## 17 male 0-17 sat 18

## 18 male 18-24 sat 16

## 19 male 25-30 sat 17

## 20 male 31+ sat 16

## 21 female 0-17 sat 11

## 22 female 18-24 sat 20

## 23 female 25-30 sat 16

## 24 female 31+ sat 12

## 25 male 0-17 very.sat 24

## 26 male 18-24 very.sat 28

## 27 male 25-30 very.sat 20

## 28 male 31+ very.sat 16

## 29 female 0-17 very.sat 25

## 30 female 18-24 very.sat 18

## 31 female 25-30 very.sat 12

## 32 female 31+ very.sat 8

Yep, all good. See how mobile.long contains what it should?

(b) (2 marks) Fit ordered logistic models to predict satisfaction from (i) gender and age group, (ii)gender only, (iii) age group only. (You don’t need to examine the models.) Don’t forget a suitableweights!

Solution: (i):

library(MASS)

mobile.1=polr(satisfied~gender+age.group,weights=frequency,data=mobile.long)

For (ii) and (iii), copy, paste and remove the explanatory variable you don’t want:

mobile.2=polr(satisfied~gender,weights=frequency,data=mobile.long)

mobile.3=polr(satisfied~age.group,weights=frequency,data=mobile.long)

Page 2

Page 3: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

We’re not going to look at these, because the output from summary is not very illuminating.What we do next is to try to figure out which (if either) of the explanatory variables age.groupand gender we need.

(c) (2 marks) Are we justified in removing gender from a model containing both gender and age.group?Do a suitable test.

Solution: This is a comparison of the model with both variables (mobile.1) and the modelwith gender removed (mobile.3). Use anova for this, smaller (fewer-x) model first:

anova(mobile.3,mobile.1)

## Likelihood ratio tests of ordinal regression models

##

## Response: satisfied

## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi)

## 1 age.group 414 1092

## 2 gender + age.group 413 1088 1 vs 2 1 4.409 0.03575

The P-value is (just) less than 0.05, so the models are significantly different. That means thatthe model with both variables in fits significantly better than the model with only age.group,and therefore that taking gender out is a mistake.

(d) (2 marks) Are we justified in removing age.group from a model containing both gender andage.group? Do a suitable test.

Solution: Exactly the same idea as the last part. In my case, I’m comparing models mobile.2and mobile.1:

anova(mobile.2,mobile.1)

## Likelihood ratio tests of ordinal regression models

##

## Response: satisfied

## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi)

## 1 gender 416 1101

## 2 gender + age.group 413 1088 1 vs 2 3 13.16 0.004295

This one is definitely significant, so I need to keep age.group for sure.

(e) (2 marks) Which of the models you have fit so far is the most appropriate one? Explain briefly.

Solution: I can’t drop either of my variables, so I have to keep them both: mobile.1, withboth age.group and gender.

(f) (2 marks) Obtain predicted probabilities of a customer falling in the various satisfaction categories,as it depends on gender and age group. To do that, you need to feed predict three things: thefitted model that contains both age group and gender, the data frame that you read in from the fileback in part (a) (which contains all the combinations of age group and gender), and an appropriatetype.

Solution: My model containing both xs was mobile.1, the data frame read in from the filewas called mobile, and I need type="p" to get probabilities:

Page 3

Page 4: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

probs=predict(mobile.1,mobile,type="p")

cbind(mobile[,1:2],probs)

## gender age.group very.unsat unsat sat very.sat

## 1 male 0-17 0.06071 0.1420 0.2753 0.5220

## 2 male 18-24 0.09213 0.1932 0.3045 0.4102

## 3 male 25-30 0.13341 0.2438 0.3085 0.3143

## 4 male 31+ 0.11668 0.2253 0.3098 0.3482

## 5 female 0-17 0.08591 0.1840 0.3012 0.4289

## 6 female 18-24 0.12858 0.2387 0.3091 0.3235

## 7 female 25-30 0.18291 0.2854 0.2920 0.2397

## 8 female 31+ 0.16112 0.2693 0.3009 0.2687

I only included the first two columns of mobile in the cbind, because the rest of the columns ofmobile were frequencies, which I don’t need to see. (Having said that, it would be interestingto make a plot using the observed proportions and predicted probabilities, but I didn’t ask youfor that.)

(g) (3 marks) Describe any patterns you see in the predictions, bearing in mind the significance or notof the explanatory variables.

Solution: I had both explanatory variables being significant, so I would expect to see both anage-group effect and a gender effect.

For both males and females, there seems to be a decrease in satisfaction as the customers getolder, at least until age 30 or so. I can see this because the predicted prob. of “very satisfied”decreases, and the predicted prob. of “very unsatisfied” increases. The 31+ age group are verysimilar to the 25–30 group for both males and females. So that’s the age group effect.

What about a gender effect? Well, for all the age groups, the males are more likely to be verysatisfied than the females of the corresponding age group, and also less likely to to be veryunsatisfied. So the gender effect is that males are more satisfied than females overall. (Or, themales are less discerning. Take your pick.)

2. This is to prepare you for something in the next question. It’s meant to be easy.

In R, the code NA stands for “missing value” or “value not known”. In R, NA should not have quotesaround it. (It is a special code, not a piece of text.)

(a) (1 mark) Create a vector v that contains some numbers and some missing values, using c().

Solution: Like this. The arrangement of numbers and missing values doesn’t matter, as longas you have some of each:

v=c(1,2,NA,4,5,6,9,NA,11)

v

## [1] 1 2 NA 4 5 6 9 NA 11

(b) (1 mark) Obtain is.na(v). When is this true and when is this false?

Solution:

is.na(v)

## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE

Page 4

Page 5: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

This is TRUE if the corresponding element of v is missing (in my case, the third value and thesecond-last one), and FALSE otherwise (when there is an actual value there).

(c) (1 mark) The symbol ! means “not” in R (and other programming languages). What does!is.na(v) do?

Solution: Try it and see:

!is.na(v)

## [1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE

This is the logical opposite of is.na: it’s true if there is a value, and false if it’s missing.

(d) (1 mark) Turn your vector v into a data frame vv by feeding v into the data.frame function.

Solution: Just this:

vv=data.frame(v=v)

vv

## v

## 1 1

## 2 2

## 3 NA

## 4 4

## 5 5

## 6 6

## 7 9

## 8 NA

## 9 11

(e) (2 marks) Use filter from dplyr to select just the rows of vv that have a non-missing value of v.

Solution:

filter, in its non-chained form, requires two things: a data frame, and something that givesback a TRUE or a FALSE for each row of the data frame. Such as !is.na(v), for example:

Page 5

Page 6: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

library(dplyr)

##

## Attaching package: ’dplyr’

##

## The following object is masked from ’package:MASS’:

##

## select

##

## The following object is masked from ’package:stats’:

##

## filter

##

## The following objects are masked from ’package:base’:

##

## intersect, setdiff, setequal, union

filter(vv,!is.na(v))

## v

## 1 1

## 2 2

## 3 4

## 4 5

## 5 6

## 6 9

## 7 11

Yes, that pulls out the rows of vv that have v not missing.

The chained way looks like this, omitting the data frame inside filter:

vv %>% filter(!is.na(v))

## v

## 1 1

## 2 2

## 3 4

## 4 5

## 5 6

## 6 9

## 7 11

3. The European Social Survey is a giant survey carried out across Europe covering demographic informa-tion, attitudes to and amount of education, politics and so on. In this question, we will investigate whatmight make British people vote for a certain political party.

The information for this question is in a (large) spreadsheet at http://www.utsc.utoronto.ca/~butler/d29/ess.csv. There is also a “codebook” at http://www.utsc.utoronto.ca/~butler/d29/ess-codebook.pdf that tells you what all the variables are. The ones we will use are the last five columns of the spread-sheet, described on pages 11 onwards of the codebook. (I could have given you more, but I didn’t wantto make this any more complicated than it already was.)

(a) (2 marks) Read in the .csv file, and use dim to verify that you have lots of rows and columns.

Solution:

Page 6

Page 7: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

ess=read.csv("ess.csv",header=T)

dim(ess)

## [1] 2286 17

2286 rows and 17 columns.

(b) (3 marks) Use the codebook to find out what the columns prtvtgb, gndr, agea, eduyrs and inwtm

are. What do the values 1 and 2 for gndr mean? (You don’t, at this point, have to worry aboutthe values for the other variables.)

Solution: Respectively, political party voted for at last election, gender (of respondent), ageat interview, years of full-time education, length of interview (in minutes). For gndr, male is 1and female is 2.

(c) (1 mark) The three major political parties in Britain are the Conservative, Labour and LiberalDemocrat. (These, for your information, correspond roughly to the Canadian Progressive Conser-vative, NDP and Liberal parties.) For the variable that corresponds to “political party voted for atthe last election”, which values correspond to these three parties?

Solution: 1, 2 and 3 respectively.

(d) (4 marks) Normally, for an assignment, I would give you a tidied-up data set. But I figure youcould use some practice tidying this one up. As the codebook shows, there are some numericalcodes for missing values, and we want to omit those.

We want just the columns prtvtgb through inwtm from the right side of the spreadsheet Use dplyr

or tidyr tools to (i) select only these columns, (ii) include the rows that correspond to people whovoted for one of the three major parties, (iii) include the rows that have an age at interview less than999, (iv) include the rows that have less than 40 years of education, (v) include the rows that arenot missing on inwtm (use the idea from Question 2 for (v)). The last four of those (the inclusionof rows) can be done in one go.

Solution:

This seems to call for a chain. The major parties are numbered 1, 2 and 3, so we can select theones less than 4 (or <=3). The reference back to the last question is a hint to use !is.na().

library(dplyr)

library(tidyr)

ess %>% select(prtvtgb:inwtm) %>%

filter(prtvtgb<4,agea<999,eduyrs<40,!is.na(inwtm)) -> ess.major

If you don’t like the right-arrow assignment at the end, you can say ess.major = and then dothe chain. Or, if you like, you can avoid the chain entirely and save the results of each step,either back in ess or into a new variable:

ess2=select(ess,prtvtgb:inwtm)

ess2=filter(ess2,prtvtgb<4,agea<999,eduyrs<40,!is.na(inwtm))

If you do the chain, you will probably not get it right the first time. (I didn’t.) For debugging,try out one step of the chain at a time, and summarize what you have so far, so that youcan check it for correctness. A handy trick for that is to make the last piece of your chainapply(2,summary) which produces a summary of the columns of the resulting data frame. Forexample, I first did this (note that my filter is a lot simpler than the one above):

Page 7

Page 8: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

ess %>% select(prtvtgb:inwtm) %>%

filter(prtvtgb<4,!is.na(inwtm)) %>%

apply(2,summary)

## prtvtgb gndr agea eduyrs inwtm

## Min. 1.0 1.00 18.0 0.0 7.0

## 1st Qu. 1.0 1.00 44.0 11.0 35.0

## Median 2.0 2.00 58.0 13.0 41.0

## Mean 1.8 1.57 61.7 14.2 43.5

## 3rd Qu. 2.0 2.00 71.0 16.0 50.0

## Max. 3.0 2.00 999.0 88.0 160.0

The mean of a categorical variable like party voted for or gender doesn’t make much sense,but it looks as if all the values are sensible ones (1 to 3 and 1, 2 respectively). However, themaximum values of age and years of education look like missing value codes, hence the otherrequirements I put in the question. (If you don’t take out the NA values, the output is notnearly so pretty.)

(e) (1 mark) Why is my response variable nominal rather than ordinal? How can I tell? Which Rfunction should I use, therefore, to fit my model?

Solution: The response variable is political party voted for. There is no (obvious) ordering tothis (unless you want to try to place the parties on a left-right spectrum), so this is nominal,and you’ll need multinom from package nnet.

If I had included the minor parties and you were working on a left-right spectrum, you wouldhave had to decide where to put the somewhat libertarian Greens or the parties that exist onlyin Northern Ireland.

(f) (2 marks) Take the political party voted for, and turn it into a factor, by feeding it into factor.Fit an appropriate model to predict political party voted for at the last election (as a factor) fromall the other variables. Gender is really a categorical variable too, but since there are only twopossible values it can be treated as a number.

Solution: This, or something like it. multinom lives in package nnet, which you’ll have toinstall first if you haven’t already:

attach(ess.major)

party=factor(prtvtgb)

library(nnet)

ess.1=multinom(party~gndr+agea+eduyrs+inwtm)

## # weights: 18 (10 variable)

## initial value 1343.602829

## iter 10 value 1256.123798

## final value 1247.110080

## converged

You can also do this without attaching if you create the new factor variable in the data framefirst:

ess.major$party=factor(ess.major$prtvtgb)

Or, if you’re a tidyr fan,

Page 8

Page 9: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

library(tidyr)

ess.major=mutate(ess.major,party=factor(prtvtgb))

In all of those cases where you don’t attach the data frame, you need to put it on the modellingline, viz.

ess.1=multinom(party~gndr+agea+eduyrs+inwtm,data=ess.major)

## # weights: 18 (10 variable)

## initial value 1343.602829

## iter 10 value 1256.123798

## final value 1247.110080

## converged

(g) (2 marks) We have a lot of explanatory variables. The standard way to test whether we need all ofthem is to take one of them out at a time, and test which ones we can remove. This is a lot of work.We won’t do that. Instead, the R function step does what you want. You feed step two things: afitted model object, and the option trace=0 (otherwise you get a lot of output). The final part ofthe output from step tells you which explanatory variables you need to keep.

Run step on your fitted model. Which explanatory variables need to stay in the model here?

Solution: I tried to give you lots of hints here:

step(ess.1,trace=0)

## trying - gndr

## trying - agea

## trying - eduyrs

## trying - inwtm

## # weights: 15 (8 variable)

## initial value 1343.602829

## iter 10 value 1248.343563

## final value 1248.253638

## converged

## trying - agea

## trying - eduyrs

## trying - inwtm

## Call:

## multinom(formula = party ~ agea + eduyrs + inwtm, data = ess.major)

##

## Coefficients:

## (Intercept) agea eduyrs inwtm

## 2 1.632 -0.02154 -0.05938 0.009615

## 3 -1.281 -0.01869 0.08865 0.009337

##

## Residual Deviance: 2497

## AIC: 2513

The end of the output gives us coefficients for (and thus tells us we need to keep) age, years ofeducation and interview length.

The actual numbers don’t mean much; it’s the indication that the variable has stayed in themodel that makes a difference.

If you’re wondering about the process: first step tries to take out each explanatory variable,one at a time, from the starting model (the one that contains all the variables). Then it finds

Page 9

Page 10: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

the best model out of those and fits it. (It doesn’t tell us which model this is, but evidentlyit’s the one without gender.) Then it takes that model and tries to remove its explanatoryvariables one at a time (there are only three of them left). Having decided it cannot removeany of them, it stops, and shows us what’s left.

Leaving out the trace=0 shows more output and more detail on the process, but I figured thiswas enough (and this way, the marker doesn’t have to wade through all of that output!).

(h) (2 marks) Fit the model indicated by step (in the last part).

Solution: Copy and paste, and take out the variables you don’t need. I found that genderneeded to be removed, but if yours is different, follow through with whatever your step saidto do.

ess.2=multinom(party~agea+eduyrs+inwtm,data=ess.major)

## # weights: 15 (8 variable)

## initial value 1343.602829

## iter 10 value 1248.343563

## final value 1248.253638

## converged

(i) (3 marks) I didn’t think that interview length could possibly be relevant to which party a personvoted for. Test whether interview length can be removed from your model of the last part. Whatdo you conclude? (Note that step and this test may disagree.)

Solution: Fit the model without inwtm:

ess.3=multinom(party~agea+eduyrs,data=ess.major)

## # weights: 12 (6 variable)

## initial value 1343.602829

## iter 10 value 1250.418281

## final value 1250.417597

## converged

and then use anova to compare them:

anova(ess.3,ess.2)

## Likelihood ratio tests of Multinomial Models

##

## Response: party

## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi)

## 1 agea + eduyrs 2440 2501

## 2 agea + eduyrs + inwtm 2438 2497 1 vs 2 2 4.328 0.1149

The P-value, 0.1149, is not small, which says that the smaller model is good, ie. the one withoutinterview length.

The reason for the disagreement is that step will tend to keep marginal explanatory variables,that is, ones that are “potentially interesting” but whose P-values might not be less than 0.05.There is still no substitute for your judgement in figuring out what to do! step uses a thingcalled AIC to decide what to do, rather than actually doing a test. If you know about “adjustedR-squared” in choosing explanatory variables for a regression, it’s the same idea: a variable canbe not quite significant but still make the adjusted R-squared go up (typically only a little).

(j) (3 marks) Use your best model to obtain predictions from some suitably chosen combinations of

Page 10

Page 11: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

values of the explanatory variables that remain. (If you have quantitative explantory variables left,you could use their first and third quartiles as values to predict from.)

Solution: First make our new data frame of values to predict from. expand.grid is our friend.You can use quantile or summary to find the quartiles. I only had agea and eduyrs left, havingdecided that interview time really ought to come out:

quantile(ess.major$agea)

## 0% 25% 50% 75% 100%

## 18 44 58 71 94

quantile(ess.major$eduyrs)

## 0% 25% 50% 75% 100%

## 0 11 13 16 33

ess.new=expand.grid(agea=c(44,71),eduyrs=c(11,16))

ess.new

## agea eduyrs

## 1 44 11

## 2 71 11

## 3 44 16

## 4 71 16

Now we feed this into predict. An annoying feature of this kind of prediction is that type

may not be what you expect. The best model is the one I called ess.3:

pp=predict(ess.3,ess.new,type="probs")

pp

## 1 2 3

## 1 0.3332 0.5094 0.1574

## 2 0.4551 0.4085 0.1363

## 3 0.3472 0.3956 0.2572

## 4 0.4676 0.3128 0.2197

cbind(ess.new,pp)

## agea eduyrs 1 2 3

## 1 44 11 0.3332 0.5094 0.1574

## 2 71 11 0.4551 0.4085 0.1363

## 3 44 16 0.3472 0.3956 0.2572

## 4 71 16 0.4676 0.3128 0.2197

(k) (2 marks) What is the effect of increasing age? What is the effect of an increase in years ofeducation?

Solution: To assess the effect of age, hold years of education constant. Thus, compare thefirst two lines (or the last two): increasing age tends to increase the chance that a person willvote Conservative (party 1), and decrease the chance that a person will vote Labour (party2). There doesn’t seem to be much effect of age on the chance that a person will vote LiberalDemocrat.

To assess education, hold age constant, and thus compare rows 1 and 3 (or rows 2 and 4). Thistime, there isn’t much effect on the chances of voting Conservative, but as education increases,the chance of voting Labour goes down, and the chance of voting Liberal Democrat goes up.

A little history: back 150 or so years ago, Britain had two political parties, the Tories andthe Whigs. The Tories became the Conservative party (and hence, in Britain and in Canada,the Conservatives are nicknamed Tories). The Whigs became Liberals. At about the same

Page 11

Page 12: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

time as working people got to vote (not women, yet, but working men) the Labour Partycame into existence. The Labour Party has always been affiliated with working people andtrades unions, like the NDP here. But power has typically alternated between Conservativeand Labour goverments, with the Liberals as a third party. In the 1980s a new party called theSocial Democrats came onto the scene, but on realizing that they couldn’t make much of a dentby themselves, they merged with the Liberals to form the Liberal Democrats, which became aslightly stronger third party.

I was curious about what the effect of interview length would be. Presumably, the effect issmall, but I have no idea which way it would be. To assess this, it’s the same idea over again:create a new data frame with all combinations of agea, eduyrs and inwtm. I need the quartilesof interview time first:

quantile(ess.major$inwtm)

## 0% 25% 50% 75% 100%

## 7 35 41 50 160

and then

ess.new=expand.grid(agea=c(44,71),eduyrs=c(11,16),inwtm=c(35,50))

ess.new

## agea eduyrs inwtm

## 1 44 11 35

## 2 71 11 35

## 3 44 16 35

## 4 71 16 35

## 5 44 11 50

## 6 71 11 50

## 7 44 16 50

## 8 71 16 50

and then predict using the model that contained interview time:

pp=predict(ess.2,ess.new,type="probs")

cbind(ess.new,pp)

## agea eduyrs inwtm 1 2 3

## 1 44 11 35 0.3456 0.4994 0.1550

## 2 71 11 35 0.4811 0.3886 0.1303

## 3 44 16 35 0.3607 0.3873 0.2521

## 4 71 16 35 0.4945 0.2969 0.2086

## 5 44 11 50 0.3140 0.5240 0.1620

## 6 71 11 50 0.4455 0.4157 0.1388

## 7 44 16 50 0.3285 0.4074 0.2641

## 8 71 16 50 0.4590 0.3183 0.2227

The effects of age and education are as they were before. A longer interview time is associatedwith a slightly decreased chance of voting Conservative and a slightly increased chance of votingLabour. Compare, for example, lines 1 and 5. But, as we suspected, the effect is small and notreally worth worrying about.

4. The Worcester heart attack survey was a long-term survey of all myocardial-infarction victims admittedto hospitals in the Worcester, Massachusetts area. (Worcester is pronounced, by locals, “Woo-stuh”.)The data have been well studied, and can be found in the file http://www.utsc.utoronto.ca/~butler/d29/whas100.csv.

(a) (2 marks) Read the data and display the first few rows of the data frame.

Page 12

Page 13: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

For your information, the variables are:

• patient ID code

• admission date

• date of last followup (this is the date of death if the patient died)

• length of hospital stay (days)

• followup time (days) (time between admission and last followup)

• followup status: 1=dead, 0=alive

• Age in years (at admission)

• gender (0=male, 1=female)

• body mass index (kg/m2)

Solution:

whas100=read.csv("whas100.csv",header=T)

head(whas100)

## X id admitdate foldate los lenfol fstat age gender bmi

## 1 1 1 3/13/1995 3/19/1995 4 6 1 65 0 31.38

## 2 2 2 1/14/1995 1/23/1996 5 374 1 88 1 22.66

## 3 3 3 2/17/1995 10/4/2001 5 2421 1 77 0 27.88

## 4 4 4 4/7/1995 7/14/1995 9 98 1 81 1 21.48

## 5 5 5 2/9/1995 5/29/1998 4 1205 1 78 0 30.71

## 6 6 6 1/16/1995 9/11/2000 7 2065 1 82 1 26.45

(b) (2 marks) Create a suitable response variable for a Cox proportional hazards model for time ofsurvival, using the followup time and followup status.

Solution: Surv. The event here is death, so the two parts of the response variable are followuptime lenfol and followup status, 1 being “dead”, fstat:

library(survival)

## Loading required package: splines

attach(whas100)

y=Surv(lenfol,fstat==1)

y

## [1] 6 374 2421 98 1205 2065 1002 2201 189 2719+ 2638+

## [12] 492 302 2574+ 2610+ 2641+ 1669 2624 2578+ 2595+ 123 2613+

## [23] 774 2012 2573+ 1874 2631+ 1907 538 104 6 1401 2710

## [34] 841 148 2137+ 2190+ 2173+ 461 2114+ 2157+ 2054+ 2124+ 2137+

## [45] 2031 2003+ 2074+ 274 1984+ 1993+ 1939+ 1172 89 128 1939+

## [56] 14 1011 1497 1929+ 2084+ 107 451 2183+ 1876+ 936 363

## [67] 1048 1889+ 2072+ 1879+ 1870+ 1859+ 2052+ 1846+ 2061+ 1912+ 1836+

## [78] 114 1557 1278 1836+ 1916+ 1934+ 1923+ 44 1922+ 274 1860+

## [89] 1806 2145+ 182 2013+ 2174+ 1624 187 1883+ 1577 62 1969+

## [100] 1054

Just using fstat alone as the second thing in Surv also works, because anything that givesTRUE or 1 when the event (death) occurs is equally good. (In R, TRUE as a number is 1 andFALSE as a number is 0.)

I listed the values by way of checking. The ones with a + are censored: that is, the patientwas still alive the last time the doctor saw them. Most of the censored values are longer times.Usually this happens because the patient was still alive at the end of the study.

Page 13

Page 14: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

(c) (3 marks) Fit a Cox proportional hazards model predicting survival time from age, gender andBMI. Obtain the summary (but you don’t need to comment on it yet).

Solution: This, using the response variable that we just created:

whas100.1=coxph(y~age+gender+bmi)

summary(whas100.1)

## Call:

## coxph(formula = y ~ age + gender + bmi)

##

## n= 100, number of events= 51

##

## coef exp(coef) se(coef) z Pr(>|z|)

## age 0.0371 1.0378 0.0127 2.92 0.0035 **

## gender 0.1432 1.1540 0.3060 0.47 0.6397

## bmi -0.0708 0.9316 0.0361 -1.96 0.0496 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## exp(coef) exp(-coef) lower .95 upper .95

## age 1.038 0.964 1.012 1.06

## gender 1.154 0.867 0.633 2.10

## bmi 0.932 1.073 0.868 1.00

##

## Concordance= 0.683 (se = 0.043 )

## Rsquare= 0.194 (max possible= 0.985 )

## Likelihood ratio test= 21.5 on 3 df, p=8.14e-05

## Wald test = 19.5 on 3 df, p=0.00022

## Score (logrank) test = 20.8 on 3 df, p=0.000115

(d) (1 mark) Test the overall fit of the model. What does the result mean?

Solution: Look at those three P-values at the bottom. They are all small, so something in themodel is helping to predict survival. As to what? Well, that’s the next part.

(e) (2 marks) Can any of your explanatory variables be removed from the model? Explain briefly.

Solution: gender has a (very) large P-value, so that can be taken out of the model. The othertwo variables have small P-values (bmi only just under 0.05), so they need to stay.

(f) (2 marks) Remove your most non-significant explanatory variable from the model and fit again.Take a look at the results. Are all your remaining explanatory variables significant? (If all yourexplanatory variables are significant, you can skip this part.)

Solution: So, take out gender:

Page 14

Page 15: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

whas100.2=coxph(y~age+bmi)

summary(whas100.2)

## Call:

## coxph(formula = y ~ age + bmi)

##

## n= 100, number of events= 51

##

## coef exp(coef) se(coef) z Pr(>|z|)

## age 0.0393 1.0401 0.0119 3.31 0.00094 ***

## bmi -0.0712 0.9313 0.0361 -1.97 0.04895 *

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## exp(coef) exp(-coef) lower .95 upper .95

## age 1.040 0.961 1.016 1.06

## bmi 0.931 1.074 0.868 1.00

##

## Concordance= 0.681 (se = 0.043 )

## Rsquare= 0.192 (max possible= 0.985 )

## Likelihood ratio test= 21.3 on 2 df, p=2.35e-05

## Wald test = 19 on 2 df, p=7.48e-05

## Score (logrank) test = 20 on 2 df, p=4.57e-05

Both explanatory variables are significant: age definitely, bmi only just.

(g) (2 marks) Calculate the 1st quartile, median, and 3rd quartiles of age and BMI. (quantile.) Roundthese off to the nearest whole number. (Do the rounding off yourself, though R has a function round

that does this, which you can investigate if you want.)

Solution:

quantile(age)

## 0% 25% 50% 75% 100%

## 32.00 59.75 71.00 80.25 92.00

quantile(bmi)

## 0% 25% 50% 75% 100%

## 14.92 23.54 27.19 30.35 39.94

60, 71 and 80 for age, 24, 27 and 30 for BMI.

Or, for example,

round(quantile(bmi))

## 0% 25% 50% 75% 100%

## 15 24 27 30 40

(h) (2 marks) Make a data frame out of all the combinations of age and BMI values (that you obtainedin the previous part) suitable for predicting with.

Solution: The inevitable expand.grid:

Page 15

Page 16: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

whas100.new=expand.grid(age=c(60,71,80),bmi=c(24,27,30))

whas100.new

## age bmi

## 1 60 24

## 2 71 24

## 3 80 24

## 4 60 27

## 5 71 27

## 6 80 27

## 7 60 30

## 8 71 30

## 9 80 30

Or, with some setup beforehand to make the expand.grid clearer:

ages=c(60,71,80)

bmis=c(24,27,30)

whas100.new=expand.grid(age=ages,bmi=bmis)

whas100.new

## age bmi

## 1 60 24

## 2 71 24

## 3 80 24

## 4 60 27

## 5 71 27

## 6 80 27

## 7 60 30

## 8 71 30

## 9 80 30

(i) (2 marks) Obtain predicted survival probabilities for each of the values in your new data frame.Use your best model. (You don’t need to look at the results, though you can if you want to.)

Solution: The magic word is survfit (which plays the role of predict here). The best modelis whas100.2, with the non-significant gender removed:

pp=survfit(whas100.2,whas100.new)

This is kind of long to look at (summary(pp) would be the thing), so we will need to make agraph of it.

(j) (4 marks) We are going to put the predictions on a graph, in the same way as in class.

Make a data frame in the same style as the one in (h) that has colour in place of age and line typein place of BMI. Show the data frame of (h) and the one you create here side by side to demonstratethat you have the right thing. (Bear in mind that you will have to figure out how many colours andlinetypes you need, and which ones they should be. Should you need them, there are linetypes calleddotted and dotdash, the last being dots and dashes alternated.) Also, don’t forget the mysteriousstringsAsFactors=F to stop R converting the names (text) into factors (categorical variables). Weneed them as text.

Solution: We have three ages and three BMIs, so we need three colours and three line types:

Page 16

Page 17: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

colours=c("red","blue","darkgreen")

linetypes=c("solid","dashed","dotdash")

I like dotdash better than dotted, but that’s just my personal preference. Also, I find thedefault green rather pale, so I’m using a darker green instead.

Now make the data frame using expand.grid:

draw.new=expand.grid(colour=colours,linetype=linetypes,stringsAsFactors=F)

Now we’ll put the data frames of (h) and here side by side:

cbind(whas100.new,draw.new)

## age bmi colour linetype

## 1 60 24 red solid

## 2 71 24 blue solid

## 3 80 24 darkgreen solid

## 4 60 27 red dashed

## 5 71 27 blue dashed

## 6 80 27 darkgreen dashed

## 7 60 30 red dotdash

## 8 71 30 blue dotdash

## 9 80 30 darkgreen dotdash

As you see, the colour matches up with age and line type with BMI. That’s good.

(k) (3 marks) Make a graph depicting the survival curves from survfit with different colours andlinetypes, as we did in class. Add legends showing what the colours and linetypes represent. (Youwill have to experiment to get the legends in the right places.)

Solution: I started out by putting one legend top right and the other one bottom left. I havevariables ages and bmis from earlier, but if you don’t, you’ll have to enter the values again:

plot(pp,col=draw.new$colour,lty=draw.new$linetype)

legend("topright",legend=ages,title="Age",fill=colours)

legend("bottomleft",legend=bmis,title="BMI",lty=linetypes)

Page 17

Page 18: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

0 500 1000 1500 2000 2500

0.0

0.2

0.4

0.6

0.8

1.0 Age

607180

BMI

242730

This seems to have worked. Maybe the two legends could be next to each other, but this looksOK to me.

(l) (2 marks) What is the effect of age on survival? What is the effect of BMI on survival? Explainbriefly.

Solution: Bear in mind that up-and-to-the-right is best for a survival curve, since that meansthat people in the upper-right group have a higher chance of surviving for longer.

With that in mind, the effect of age is that a younger person has better survival rate. (Forexample, though you don’t need to say this, a 60-year-old with the middle BMI of 27 has a 0.6chance of surviving to 2500 days, but an 80-year-old with the same BMI has only about a 0.3chance.)

The effect of BMI, though, seems backwards: a higher BMI is associated with a higher chanceof survival. (For example, for a 60-year-old with BMI 30, the 2500-day survival chance is above0.6, but with BMI 24, it is less than 0.5.)

That’s the end of what I wanted you to do, but:

Page 18

Page 19: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

A higher BMI is usually associated with being obese (and therefore unhealthy), so you’d expectthe effect of BMI to be the other way around. According to Wikipedia (http://en.wikipedia.org/wiki/Body_mass_index), the BMI values here are “overweight” or close to it. Maybe beingheavier helps the body recover from a heart attack. Or maybe the relationship is nonlinear,something that we didn’t explore:

bmi.sq=bmi^2

whas100.3=coxph(y~age+bmi+bmi.sq)

summary(whas100.3)

## Call:

## coxph(formula = y ~ age + bmi + bmi.sq)

##

## n= 100, number of events= 51

##

## coef exp(coef) se(coef) z Pr(>|z|)

## age 0.04054 1.04137 0.01203 3.37 0.00076 ***

## bmi -0.84895 0.42786 0.23156 -3.67 0.00025 ***

## bmi.sq 0.01450 1.01461 0.00423 3.43 0.00060 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## exp(coef) exp(-coef) lower .95 upper .95

## age 1.041 0.960 1.017 1.066

## bmi 0.428 2.337 0.272 0.674

## bmi.sq 1.015 0.986 1.006 1.023

##

## Concordance= 0.693 (se = 0.043 )

## Rsquare= 0.264 (max possible= 0.985 )

## Likelihood ratio test= 30.7 on 3 df, p=9.78e-07

## Wald test = 32.6 on 3 df, p=3.98e-07

## Score (logrank) test = 36.6 on 3 df, p=5.67e-08

Ah, that seems to be it. The significant positive coefficient on bmi.sq means that the “hazardof dying” increases faster with increasing bmi, so there ought to be an optimal BMI beyondwhich survival chances decrease again. Let’s explore that on a graph.

I’m going to focus on a close-to-median age of 70, since, in this model, the effect of BMI is thesame for all ages (to make it different, we would need an interaction term, ANOVA-style). Letme use colours for BMI this time, and take several different BMIs. Below is a lot of code, butit’s exactly the same ideas as above, bearing in mind that I’ve switched the roles of colour andlinetype:

Page 19

Page 20: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

bmis=seq(20,36,4)

ages=c(70)

whas100.new=expand.grid(age=ages,bmi=bmis)

whas100.new$bmi.sq=whas100.new$bmi^2

whas100.new

## age bmi bmi.sq

## 1 70 20 400

## 2 70 24 576

## 3 70 28 784

## 4 70 32 1024

## 5 70 36 1296

pp=survfit(whas100.3,whas100.new)

colours=c("red","blue","darkgreen","brown","purple")

linetypes=c("solid")

draw.new=expand.grid(linetype=linetypes,colour=colours,stringsAsFactors=F)

cbind(whas100.new,draw.new)

## age bmi bmi.sq linetype colour

## 1 70 20 400 solid red

## 2 70 24 576 solid blue

## 3 70 28 784 solid darkgreen

## 4 70 32 1024 solid brown

## 5 70 36 1296 solid purple

This took me a couple of goes to get right. I got my predictions from the wrong model(whas100.2 instead of the whas100.3 that has the BMI-squared term in), then I forgot thatwhas100.new had to have a column called bmi.sq in it, which I had to calculate. And in theplot (below), I forgot to switch colour and line type around. (This goes to show that whatyou see here is not what first came out of my head, but the result of sometimes considerableediting.)

plot(pp,col=draw.new$colour,lty=draw.new$linetype)

legend("topright",legend=bmis,title="BMI",fill=colours)

legend("bottomleft",legend=ages,title="Age",lty=linetypes)

Page 20

Page 21: STAD29 / STA 1007 assignment 5butler/d29/a5s.pdf · ## Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi) ## 1 age.group 414 1092 ## 2 gender + age.group 413 1088 1 vs 2 1 4.409

0 500 1000 1500 2000 2500

0.0

0.2

0.4

0.6

0.8

1.0 BMI

2024283236

Age

70

This time, the dark green survival curve is best, which means that survival is best at BMI 28,and worse for both higher BMIs and lower BMIs. You can follow the sequence of colours: red,blue, dark green, brown, purple, that goes up and then down again. But it’s still true thathaving a very low BMI is worst, which is why our (linear) model said that having a higher BMIwas better.

It would have been better to have you put a squared term in the model, but the question wasalready long and complicated enough, and I didn’t want to make your lives more of a nightmarethan they are already becoming!

Page 21