lecture 3 - residual analysis + generalized linear modelscr173/sta444_sp17/slides/lec3.pdf ·...

44
Lecture 3 Residual Analysis + Generalized Linear Models Colin Rundel 1/23/2017 1

Upload: hangoc

Post on 28-Jul-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Lecture 3Residual Analysis + Generalized Linear Models

Colin Rundel1/23/2017

1

Page 2: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Residual Analysis

2

Page 3: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Atmospheric CO2 (ppm) from Mauna Loa

350

360

1988 1992 1996

date

co2

3

Page 4: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Where to start?

Well, it looks like stuff is going up on average …

350

360

1988 1992 1996

date

co2

−2.5

0.0

2.5

1988 1992 1996

date

resi

d

4

Page 5: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Where to start?

Well, it looks like stuff is going up on average …

350

360

1988 1992 1996

date

co2

−2.5

0.0

2.5

1988 1992 1996

date

resi

d

4

Page 6: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

and then?

Well there is some periodicity lets add the month …

−2.5

0.0

2.5

1988 1992 1996

date

resi

d

−1.0

−0.5

0.0

0.5

1.0

1988 1992 1996

date

resi

d2

5

Page 7: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

and then?

Well there is some periodicity lets add the month …

−2.5

0.0

2.5

1988 1992 1996

date

resi

d

−1.0

−0.5

0.0

0.5

1.0

1988 1992 1996

date

resi

d2

5

Page 8: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

and then and then?

Maybe there is some different effect by year …

−1.0

−0.5

0.0

0.5

1.0

1988 1992 1996

date

resi

d2

−0.8

−0.4

0.0

0.4

0.8

1988 1992 1996

date

resi

d3

6

Page 9: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

and then and then?

Maybe there is some different effect by year …

−1.0

−0.5

0.0

0.5

1.0

1988 1992 1996

date

resi

d2

−0.8

−0.4

0.0

0.4

0.8

1988 1992 1996

date

resi

d3

6

Page 10: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Too much

(lm = lm(co2~date + month + as.factor(year), data=co2_df))#### Call:## lm(formula = co2 ~ date + month + as.factor(year), data = co2_df)#### Coefficients:## (Intercept) date monthAug## -2.645e+03 1.508e+00 -4.177e+00## monthDec monthFeb monthJan## -3.612e+00 -2.008e+00 -2.705e+00## monthJul monthJun monthMar## -2.035e+00 -3.251e-01 -1.227e+00## monthMay monthNov monthOct## 4.821e-01 -4.838e+00 -6.135e+00## monthSep as.factor(year)1986 as.factor(year)1987## -6.064e+00 -2.585e-01 9.722e-03## as.factor(year)1988 as.factor(year)1989 as.factor(year)1990## 1.065e+00 9.978e-01 7.726e-01## as.factor(year)1991 as.factor(year)1992 as.factor(year)1993## 7.067e-01 1.236e-02 -7.911e-01## as.factor(year)1994 as.factor(year)1995 as.factor(year)1996## -4.146e-01 1.119e-01 3.768e-01## as.factor(year)1997## NA

7

Page 11: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

350

360

1988 1992 1996

date

co2

8

Page 12: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Generalized Linear Models

9

Page 13: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Background

A generalized linear model has three key components:

1. a probability distribution (from the exponential family) that describesyour response variable

2. a linear predictor η = Xβ,

3. and a link function g such that g(E(Y|X)) = µ = η.

10

Page 14: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Poisson Regression

11

Page 15: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Model Specification

A generalized linear model for count data where we assume the outcomevariable follows a poisson distribution (mean = variance).

Yi ∼ Poisson(λi)

log E(Y|X) = logλ = Xβ

12

Page 16: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Example - AIDS in Belgium

0

50

100

150

200

250

1985 1990

year

case

s

AIDS cases in Belgium

13

Page 17: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Frequentist glm fit

g = glm(cases~year, data=aids, family=poisson)pred = data_frame(year=seq(1981,1993,by=0.1))pred$cases = predict(g, newdata=pred, type = ”response”)

0

100

200

300

1985 1990

year

case

s

AIDS cases in Belgium

14

Page 18: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Residuals

Standard residuals:

ri = Yi − Yi = Yi − λi

Pearson residuals:

ri =Yi − E(Yi|X)√Var(Yi|X)

=Yi − λi√

λi

Deviance residuals:

di = sign(yi − λi)√2(yi log(yi/λi) − (yi − λi))

15

Page 19: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Deviance and deviance residuals

Deviance can be interpreted as the difference between your model’s fit andthe fit of an ideal model (where E(Yi) = Yi).

D = 2(L(Y|θbest) − L(Y|θ)) =n∑i=1

di2

Deviance is a measure of goodness of fit in a similar way to the residualsum of squares (which is just the sum of squared standard residuals).

16

Page 20: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Deviance residuals derivation

17

Page 21: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Residual plots

standard pearson deviance

1985 1990 1985 1990 1985 1990−5.0

−2.5

0.0

2.5

−2.5

0.0

2.5

−50

0

year

resi

dual

18

Page 22: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

print(aids_fit)

0

100

200

300

1985 1990

year

case

s

AIDS cases in Belgium

19

Page 23: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Quadratic fit

g2 = glm(cases~year+I(year^2), data=aids, family=poisson)pred2 = data_frame(year=seq(1981,1993,by=0.1))pred2$cases = predict(g2, newdata=pred, type = ”response”)

0

50

100

150

200

250

1985 1990

year

case

s

AIDS cases in Belgium

20

Page 24: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Quadratic fit - residuals

standard pearson deviance

1985 1990 1985 1990 1985 1990

−1

0

1

−1

0

1

−10

0

10

20

year

resi

dual

21

Page 25: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Bayesian Poisson Regression Model

## model{## # Likelihood## for(i in 1:length(Y)){## Y[i] ~ dpois(lambda[i])## log(lambda[i]) <- beta[1] + beta[2]*X[i]#### # In-sample prediction## Y_hat[i] ~ dpois(lambda[i])## }#### # Prior for beta## for(j in 1:2){## beta[j] ~ dnorm(0,1/100)## }## }

22

Page 26: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Bayesian Model fit

m = jags.model(textConnection(poisson_model1), quiet = TRUE,data = list(Y=aids$cases, X=aids$year)

)update(m, n.iter=1000, progress.bar=”none”)samp = coda.samples(m, variable.names=c(”beta”,”lambda”,”Y_hat”),n.iter=5000, progress.bar=”none”

)

23

Page 27: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

MCMC Diagnostics

2000 3000 4000 5000 6000 7000

−5

−3

−1

Iterations

Trace of beta[1]

−6 −4 −2 0

0.0

0.2

Density of beta[1]

N = 5000 Bandwidth = 0.3209

2000 3000 4000 5000 6000 7000

0.00

200.

0040

Iterations

Trace of beta[2]

0.002 0.003 0.004 0.005

040

0

Density of beta[2]

N = 5000 Bandwidth = 0.0001615

24

Page 28: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Model fit?

0

100

200

300

1985 1990

year

case

s

AIDS cases in Belgium

25

Page 29: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

What went wrong?

summary(g)#### Call:## glm(formula = cases ~ year, family = poisson, data = aids)#### Deviance Residuals:## Min 1Q Median 3Q Max## -4.6784 -1.5013 -0.2636 2.1760 2.7306#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) -3.971e+02 1.546e+01 -25.68 <2e-16 ***## year 2.021e-01 7.771e-03 26.01 <2e-16 ***## ---## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1#### (Dispersion parameter for poisson family taken to be 1)#### Null deviance: 872.206 on 12 degrees of freedom## Residual deviance: 80.686 on 11 degrees of freedom## AIC: 166.37#### Number of Fisher Scoring iterations: 4

26

Page 30: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

What went wrong?

summary(g)#### Call:## glm(formula = cases ~ year, family = poisson, data = aids)#### Deviance Residuals:## Min 1Q Median 3Q Max## -4.6784 -1.5013 -0.2636 2.1760 2.7306#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) -3.971e+02 1.546e+01 -25.68 <2e-16 ***## year 2.021e-01 7.771e-03 26.01 <2e-16 ***## ---## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1#### (Dispersion parameter for poisson family taken to be 1)#### Null deviance: 872.206 on 12 degrees of freedom## Residual deviance: 80.686 on 11 degrees of freedom## AIC: 166.37#### Number of Fisher Scoring iterations: 4 26

Page 31: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

summary(glm(cases~I(year-1981), data=aids, family=poisson))#### Call:## glm(formula = cases ~ I(year - 1981), family = poisson, data = aids)#### Deviance Residuals:## Min 1Q Median 3Q Max## -4.6784 -1.5013 -0.2636 2.1760 2.7306#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 3.342711 0.070920 47.13 <2e-16 ***## I(year - 1981) 0.202121 0.007771 26.01 <2e-16 ***## ---## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1#### (Dispersion parameter for poisson family taken to be 1)#### Null deviance: 872.206 on 12 degrees of freedom## Residual deviance: 80.686 on 11 degrees of freedom## AIC: 166.37#### Number of Fisher Scoring iterations: 4

27

Page 32: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Revising the model

## model{## # Likelihood## for(i in 1:length(Y)){## Y[i] ~ dpois(lambda[i])## log(lambda[i]) <- beta[1] + beta[2]*(X[i] - 1981)#### # In-sample prediction## Y_hat[i] ~ dpois(lambda[i])## }#### # Prior for beta## for(j in 1:2){## beta[j] ~ dnorm(0,1/100)## }## }

28

Page 33: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

MCMC Diagnostics

2000 3000 4000 5000 6000 7000

3.1

3.3

3.5

Iterations

Trace of beta[1]

3.1 3.2 3.3 3.4 3.5 3.6

02

4

Density of beta[1]

N = 5000 Bandwidth = 0.01336

2000 3000 4000 5000 6000 7000

0.18

0.21

Iterations

Trace of beta[2]

0.18 0.19 0.20 0.21 0.22 0.23

020

40

Density of beta[2]

N = 5000 Bandwidth = 0.001456

29

Page 34: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Model fit

0

100

200

300

1985 1990

year

case

s

(Linear Poisson Model)

AIDS cases in Belgium

30

Page 35: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Bayesian Residual Plots

standard pearson deviance

1985 1990 1985 1990 1985 1990

−6

−4

−2

0

2

−6

−4

−2

0

2

4

−100

−50

0

50

year

post

_mea

n

31

Page 36: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Model fit

0

100

200

300

1985 1990

year

case

s

(Quadratic Poisson Model)

AIDS cases in Belgium

32

Page 37: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Bayesian Residual Plots

standard pearson deviance

1985 1990 1985 1990 1985 1990

−2

−1

0

1

2

−2

−1

0

1

2

−20

0

20

40

year

post

_mea

n

33

Page 38: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Negative Binomial Regression

34

Page 39: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Overdispersion

One of the properties of the Poisson distribution is that if X ∼ Pois(λ) thenE(X) = Var(X) = λ.

If we are constructing a model where we claim that our response variable Yfollows a Poisson distribution then we are making a very strong assumptionwhich has implactions for both inference and prediction.

mean(aids$cases)## [1] 124.7692var(aids$cases)## [1] 8124.526

35

Page 40: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Overdispersion

One of the properties of the Poisson distribution is that if X ∼ Pois(λ) thenE(X) = Var(X) = λ.

If we are constructing a model where we claim that our response variable Yfollows a Poisson distribution then we are making a very strong assumptionwhich has implactions for both inference and prediction.

mean(aids$cases)## [1] 124.7692var(aids$cases)## [1] 8124.526

35

Page 41: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Negative binomial regession

If we define

Yi|Zi ∼ Pois(λi Zi)

Zi ∼ Gamma(θi, θi)

then the marginal distribution of Yi will be negative binomial with,

E(Yi) = λi

Var(Yi) = λi + λ2i /θi

36

Page 42: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Model

## model{## for(i in 1:length(Y))## {## Z[i] ~ dgamma(theta, theta)## log(lambda[i]) <- beta[1] + beta[2]*(X[i] - 1981) + beta[3]*(X[i] - 1981)^2#### lambda_Z[i] <- Z[i]*lambda[i]#### Y[i] ~ dpois(lambda_Z[i])## Y_hat[i] ~ dpois(lambda_Z[i])## }#### for(j in 1:3){## beta[j] ~ dnorm(0, 1/100)## }#### log_theta ~ dnorm(0, 1/100)## theta <- exp(log_theta)## }

37

Page 43: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Negative Binomial Model fit

0

100

200

300

1985 1990

year

case

s

(Quadratic Poisson Model)

AIDS cases in Belgium

0

100

200

300

1985 1990

year

case

s

(Quadratic Negative Binomial Model)

AIDS cases in Belgium

38

Page 44: Lecture 3 - Residual Analysis + Generalized Linear Modelscr173/Sta444_Sp17/slides/Lec3.pdf · Lecture 3 - Residual Analysis + Generalized Linear Models Author: Colin Rundel Created

Bayesian Residual Plots

standard pearson

1985 1990 1985 1990

−2

−1

0

1

2

−20

0

20

40

year

post

_mea

n

39