4€¦ · web viewuse models to investigate the relationships (associations) among categorical...

3. Generalized linear models

Use models to investigate the relationships (associations) among categorical and continuous variables.

Reason for using models (p. 65): Helps describe the pattern of association and

interaction Inferences for model parameters help determine which

explanatory variables affect the response while controlling for other variables

Estimate model parameters to determine the strength and importance of effects

Models can more easily handle complicated problems

A general class of models are “generalized linear models” (GLMs). You have already studied a special case of GLMs, linear models, previously in regression and ANOVA courses.

In addition to Agresti (2002), other references on GLMs include:

McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. 2nd edition. London: Chapman and Hall.

McCulloch, C. and Searle, S. R. (2000). Generalized, Linear, and Mixed Models. New York: Wiley.

2010 Christopher R. Bilder

3.1

3.1 Components of a generalized linear model

Review of regression models

Yi = 0 + 1xi1 + 2xi2 + … + kxik + i

where i~independent N(0, 2).

Note that

E(Yi) = 0 + 1xi1 + 2xi2 + … + kxik

E(Yi) is what one would expect Yi to be on average for a set of xi1, xi2, …, xik values.

One of the important things to realize here is that Y has a normal distribution. What if this is not true? Suppose Y is a nominal categorical variable. Suppose Y has a Poisson distribution. There are many other possibilities. GLMs allow us to generalize the model structure!

Three components of a GLM

Random

For a sample of size n, denote the observations of the response variable Y as Y1, Y2, …, Yn. Assume Y1, Y2, …, Yn are obtained independently here. We will specifically be interested in the E(Y)=.


3.2

The distribution chosen for Y defines the “random” component of a GLM.

For example, suppose the Y1, Y2, …, Yn are responses from a Bernoulli random variable Y. Thus, the Y1, Y2, …, Yn are all 0 or 1.

Suppose Y1, Y2, …, Yn are responses from a Binomial random variable Y. Thus, the Y1, Y2, …, Yn are all nonnegative integers and denote the number of successes out of a certain number of trials.

Suppose Y1, Y2, …, Yn are responses from a Poisson random variable Y. The Y1, Y2, …, Yn are all non-negative integers and could denote cell counts in a contingency table.

In regression and ANOVA, Y1, Y2, …, Yn are responses from a normal random variable Y.

Systematic

This component specifies the explanatory variables:

+ 1x1 + 2x2 + … + kxk

Notice that this is a “linear” combination of the explanatory variables. This is often called the “linear predictor”. Note


3.3

that the x’s above could be a transformation of an original explanatory variable(s) – such as a quadratic or interactions.

Link

This component “links” the random and systematic component. In other words, this shows how the mean of the distribution for Y is related to the explanatory variables.

Let g() be a function of the E(Y)=. This is the link function. Specifically, the GLM is

g() = + 1x1 + 2x2 + … + kxk

Link functions: Identity – g() =

= E(Y) = + 1x1 + 2x2 + … + kxk

This is used for regression and ANOVA models!

Log – g() = log()

log() = + 1x1 + 2x2 + … + kxk

= exp( + 1x1 + 2x2 + … + kxk)


3.4

The log link is used for “loglinear” models in Chapter 7. Most often, Y is assumed to have a Poisson distribution. Notice that all values of will be positive. This is why the link is used with modeling counts in a contingency table!

Logit – g() = = logit()

= + 1x1 + 2x2 + … + kxk

=

The logit link is used for “logit” and logistic regression models in Chapters 4-5. Notice that all values of will be between 0 and 1 (try a few sample cases to see this). This is why the link is used with modeling probabilities! Remember that the mean of a Bernoulli random variable is .

Other links are possible such as the probit and complementary log-log. These will be discussed later.

Read Section 3.1.4 about the normal GLM on p.67-8.


3.5

Chris Bilder, 01/08/02,

numerator always < denominator

3.2 Generalized linear models for binary data

Binary data means observations obtained from a random variable with only two possible values. Typically, these two possible values are called a “success” and a “failure”.

From Chapter 1:

Bernoulli distribution: P(Y=y) = for y=0 or 1

This is a special case of the binomial with n=1. The expected value of Y is E(Y) = and the variance of Y is Var(Y) = (1-).

The goal in this section to find a GLM to model at specific values of explanatory variables (x’s)

For example, suppose you want to estimate the probability of success, , of a field goal. The value of will probably be different for a 20 yard field goal than for a 50 yard field goal. Thus, it would be of interest to incorporate length of a field goal in a model for .

Notation: Agresti (2007) uses (x) to denote here. The reason is because explanatory variables (x’s) will be used to try to predict the value of . Thus, “depends” on the level of the explanatory variables.


3.6

To simplify the upcoming discussion, only one explanatory variable, x, will be used to model the probability of success, (x).

Linear probability model

Suppose an ordinary regression model was used to model the probability of success. Thus,

E(Y) = (x) = + x

with Y~N(0,2). This is called a linear probability model because the probability of success changes in a linear manner.

Problems with model:o Violates the distributional assumptions for Y. Y is

Bernoulli, not normal.o Probabilities can be less than 0 or greater than 1!o Non constant variance – Var(Y) = (x)(1-(x));

variance changes as a function of x

Therefore, do not use this model!

Logistic regression model

A great introductory reference on logistic regression is


3.7

Hosmer, D. W. and Lemeshow, S. (2000). Applied Logisitic Regression, 2nd edition. New York: Wiley.

Many STAT 870 books also will include a chapter on logistic regression. For example, see Chapter 14 of Kutner, Nachtsheim, and Neter (2004).

The model is

= logit[(x)] = + x

The random component is Bernoulli. The logit transformation is the link function. The model can be equivalently written as:

What does a plot of (x) vs. x look like?

Example: Plot of (x) vs. x (pi_plot.R)

When there is only one explanatory variable, =1, and =0.5, a plot of (x) vs. x looks like the following:


3.8

-15 -10 -5 0 5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

xex

1 ex

x

x

When =1 and =-0.5, the plot of (x) vs. x looks like the following:

-15 -10 -5 0 5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

xex

1 ex

x

x

R code:alpha<-1beta1<-0.5

par(pty="s")curve(expr = exp(alpha+beta1*x)/(1+exp(alpha+beta1*x)),


3.9

from = -15, to = 15, col = "red", main = expression(pi(x) == frac(e^{alpha+beta*x}, 1+e^{alpha+beta*x})), xlab = "x", ylab = expression(pi(x)), panel.first = grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted"))#See help(plotmath) for more on the expression function and see demo(plotmath)

Notes: When >0, there is a positive relationship between x

and (x). When <0, there is a negative relationship between x and (x).

The shape of the function is similar to an “s”. Notice the symmetric shape about (x) = 0.5 0<(x)<1 Questions: What happens to the =0.5 plot when is

increased?What happens to the =0.5 plot when is

decreased to be close to 0?Suppose a plot of logit[(x)] vs. x was made. What

would the plot look like?

Parameter estimation

Suppose there is a random sample of size n providing (y1, x1), (y2, x2), …, (yn, xn) where the yi’s are 0’s or 1’s. The probability of observing a 1 for yi is denoted by i(x).

The logistic regression model is 2010 Christopher R. Bilder

3.10


Straight line with the direction dependent on beta value

= + xi for i=1,…,n

This is the assumed relationship between the xi and i(x). The model can be rewritten as

i(x) = .

Parameter estimates can be found from maximum likelihood estimation – see Chapter 1’s discussion.

Then the log likelihood function is

Since i(x) = , this implies


3.11

n different parameters

The maximum likelihood estimates of and are the values which maximize the above quantity. Since these estimates can only be found using numerical methods (except in special cases), parameter estimates are found by many software packages by using iteratively reweighted least squares to yield the maximum likelihood estimates. See p. 88 of Agresti (2007) and p. 143-149 of Agresti (2002) for more information. The R function, glm(), finds the parameter estimates (using a call to optim()).

By using the model, the complexity of estimating has been reduced from estimating n different parameters (one for each i = 1,…,n) to only 2 - and !

Example: Placekicking (placekick_ch3.R, place.s.csv)

See Bilder and Loughin (Chance, 1998) and the video!

The purpose of this example is to estimate the probability of success for a placekick in football. The place.s.csv data file contains a sample of 1,425


3.12

Now only two parameters!

placekicks attempted during the 1995 National Football League season. Below is a brief description of the variables in the data set: week = Week of the season dist = distance of the placekick in yards change = binary variable denoting lead-change

placekicks (1) vs. non lead-change (0) placekicks elap30 = continuous variable denoting the number of

minutes left in a half with overtime placekicks assigned a value of 0

pat1 = binary variable for whether the placekick is a point after touchdown (1) or a field goal (0)

type1 = binary variable for placekicks in a dome (0) or outdoors (1)

field 1 = binary variable for placekicks on grass (1) or artificial turf (0)

good1 = binary variable for placekicks which are successes (1) or failures (0)

wind = binary variable for placekicks attempted in “windy” conditions (1) at kickoff versus non-windy conditions (0) using a 15 mph cutoff for non-windy

The data was actually first stored in an Excel file. While one can use the xlsReadWrite or RODBC packages to read in an Excel file (see R introduction lecture), I used a different method to read in the file. First, I re-saved the Excel file as a .csv format. To do this, select FILE > SAVE AS in Excel. Then select the .csv format in the


3.13

SAVE AS TYPE box. Choose a file name and then select SAVE.

This creates an ASCII text file which has commas separating each variable.

In order to get the data into R, I used the read.table() function as shown below:


3.14

> place.s<-read.table(file = "C:\\chris\\UNL\\STAT875\\ chapter3_new\\place.s.csv", header = TRUE, sep = ",")> head(place.s) week dist change elap30 pat1 type1 field1 good1 wind1 1 21 1 24.7167 0 1 1 1 02 1 21 0 15.8500 0 1 1 1 03 1 20 0 0.4500 1 1 1 1 04 1 28 0 13.5500 0 1 1 1 05 1 20 0 21.8667 1 0 0 1 06 1 25 0 17.6833 0 0 0 1 0

For now, only distance (dist) is going to be used to predict the probability of a successful placekick. The good1 variable contains the Bernoulli observations denoting the success or failure of a placekick.

The logistic regression model of interest is

(x)log1 (x)

= logit[(x)] = + x = + (Distance)

where x=distance of the placekick, (x) = E(Y), Y=1 for success or 0 for failure.

This particular GLM is used since the response variable (good1) is binary. To find the estimated model in R, the glm() function is used. Below is the code.

> mod.fit <- glm(formula = good1 ~ dist, data = place.s, family = binomial(link = logit), na.action = na.exclude, control = list(epsilon = 0.0001, maxit = 50, trace = T))

Deviance = 836.7715 Iterations - 1 Deviance = 781.1072 Iterations - 2 Deviance = 775.8357 Iterations - 3


3.15

Deviance = 775.7451 Iterations - 4 Deviance = 775.745 Iterations - 5

> names(mod.fit) [1] "coefficients" "residuals" "fitted.values" "effects" "R" "rank" "qr" "family" "linear.predictors" "deviance" "aic" "null.deviance" "iter" "weights" "prior.weights" "df.residual" "df.null" "y" "converged" "boundary" "model" "call" "formula" "terms" "data" “offset" "control" "method" "contrasts" "xlevels" > mod.fit$coefficients (Intercept) dist 5.812045 -0.1150259

> mod.fit

Call: glm(formula = good1 ~ dist, family = binomial(link = logit), data = place.s, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T))

Coefficients:(Intercept) dist 5.8121 -0.1150

Degrees of Freedom: 1424 Total (i.e. Null); 1423 ResidualNull Deviance: 1013 Residual Deviance: 775.7 AIC: 779.7

> summary(mod.fit)

Call:glm(formula = good1 ~ dist, family = binomial(link = logit), data = place.s, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T))

Deviance Residuals: Min 1Q Median 3Q Max -2.7441 0.2425 0.2425 0.3801 1.6091


3.16

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.812079 0.326158 17.82 <2e-16 ***dist -0.115027 0.008337 -13.80 <2e-16 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1013.43 on 1424 degrees of freedomResidual deviance: 775.75 on 1423 degrees of freedomAIC: 779.75

Number of Fisher Scoring iterations: 5

There are many different things that need to be discussed about the code and output. Only a few of them will be discussed here. More will be discussed later in this chapter and in Chapter 5. Notice the syntax used with the glm() function. The names(mod.fit) shows the different components of

the mod.fit object. The estimated logistic regression model is

= 5.8121 – 0.1150x What happens to the probability of success as the

distance increases? The estimated probability of success for a particular

distance can be found from


3.17

For example, the estimated probability of success for a 20 yard placekick is

0.9710

The estimated probability of success for a 50 yard placekick is

0.5152

> #Estimated probability of success for a 20 yard field goal> lin.pred<-mod.fit$coefficients[1]+mod.fit$coefficients[2]*20> exp(lin.pred)/(1 + exp(lin.pred)) 0.971014

> #Estimated probability of success for a 50 yard field goal> lin.pred<-mod.fit$coefficients[1]+mod.fit$coefficients[2]*50> exp(lin.pred)/(1 + exp(lin.pred))

0.5151829

The z value in the output is a z test statistic which gives a test for whether the corresponding parameter is 0 or not. This test statistic can be compared to a standard normal distribution. Is distance important to predicting the probability of success for a placekick?

A simple plot of the “fitted values” versus distance is:


3.18

#Simple plotplot(x = place.s$dist, y = mod.fit$fitted.values,

xlab="Distance (yards)", ylab="Estimated probability", main = "Estimated probability of success of a placekick")

Note that this plot would not be appropriate to hand in for a project. Much better plots will be shown soon.

There are often many observations for the same distance. For example, there are 20 placekicks from 21 yards and 19 of them are successful. This information can be found from using the table() or xtabs() functions. > #Summary of the placekicks by distance> dist.good <- table(place.s$dist, place.s$good1)> dist.goodinteger matrix: 43 rows, 2 columns. 0 1


3.19

18 1 219 0 720 13 77621 1 1922 2 1223 1 2624 0 725 1 12

EDITED

55 1 256 0 159 0 162 1 063 1 066 1 0

Another way to put the data into this format is to use the gsummary() function.

> library(nlme)> place.small<-data.frame(good = place.s$good1, dist = place.s$dist) > place.sum<-gsummary(object = place.small, FUN = sum, groups = place.small$dist) > place.length<-gsummary(object = place.small, FUN = length, groups = place.small$dist) > prop<-place.sum$good/place.length$good > place.pattern<-data.frame(sum.y = place.sum$good, n = place.length$good, prop = prop, distance=place.sum$dist)> head(place.pattern) sum.y n prop distance1 2 3 0.6666667 182 7 7 1.0000000 193 776 789 0.9835234 204 19 20 0.9500000 215 12 14 0.8571429 226 26 27 0.9629630 23


3.20

bilder, 12/06/06,

tapply() is another way to do this type of summary - see Chapter 3 of my STAT 950 (bootstrap) notes

Below is a plot of the estimated probability of success using the estimated logistic regression model. The observed proportions of successes are the plotting points. For example, there is a 19/20=0.95 at 21 yards. This type of plot can be used as a measure of how well the model fits the data.

What do you think about the fit of the model?

20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Distance (yards)

Est

imat

ed p

roba

bilit

y

Estimated probability of success of a placekick with observed proportions

> #Find plot of the observed proportions > plot(x = place.pattern$distance, y = place.pattern$prop,


3.21

xlab = "Distance (yards)", ylab = "Estimated probability", main = "Estimated probability of success of a placekick \n with observed proportions", panel.first=grid(col="gray", lty="dotted")) > curve(expr = exp(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x) / (1+exp(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x)), col = "red", add = TRUE)

#Quicker way to do curve() here will learn about later#curve(plogis(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x), col = “red”, add = TRUE)

You may think the model fits poorly at the larger distances. This is not necessarily true! The binary nature of the data can distort the perceived fit. At most of the larger distances, there are very few placekicks. For example, there was only one 59 yard placekick attempted and it was a success. Thus, the proportion of successful placekicks at this distance is 1/1 =1.

To help make a judgment about the fit of the model, I created the bubble plot below. A bubble plot is a scatter plot with the plotting point proportional to another variable. The other variable in this case is the number of placekicks at each distance. Notice how the extreme proportions are the placekicks at distances without many observations.

The circles = __ option provides the third variable displayed in the plot as the size of the plotting point.


3.22

10 20 30 40 50 60

0.0

0.5

1.0

1.5

Distance (yards)

Est

imat

ed p

roba

bilit

yEstimated probability of success of a placekick

with observed proportions

#plots the plotting pointssymbols(x = place.pattern$distance, y = place.pattern$prop,

circles=sqrt(place.pattern$n), inches = 1, xlab = "Distance (yards)", ylab="Estimated probability", xlim = c(10,65), ylim = c(0, 1.5), main = "Estimated probability of success of a placekick \n with observed proportions", panel.first = grid(col = "gray", lty = "dotted"))

#Puts the estimated logistic regression model on the plotcurve(expr = exp(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x) /

(1+exp(mod.fit$coefficients[1]+mod.fit$coefficients[2]*x)), col = "red", add = TRUE)

Questions: 2010 Christopher R. Bilder

3.23

Which placekicks does the largest bubble represent? Suppose the plot looked like this (this plot was edited in

PowerPoint; note different scale):

20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

1.2


Distance (yards)

Est

imat

ed p

roba

bilit

y

What do you think about the fit of the model?

Note:

The inches = __ option in the symbols() function controls the size of the larges circle. The default is 1” in height. You may need to change this to help make the plot more informative for a particular problem. Also, I used the sqrt() function here with the circles = ___ option since the disparity between the largest place.pattern$n value and the others is so large. Other functions could have been used as well. Examine what the plot looks like on your own without the sqrt() function to see how much it helped.


3.24

Alternative binary links

Many other link functions could be used to model binary data. These links functions use the “cumulative distribution function” or CDF. Below is a formal definition.

Let X be a continuous random variable with probability density function f(x). An observed value of X is denoted by x. The cumulative distribution function of

X is F(x) = P(Xx) = . Note that u is

substituted into the probability distribution function to avoid confusion with the upper limit of integration. If X is a discrete random variable, the cumulative

distribution function of X is F(x) = P(Xx) = =

where the sum is over all values of Xx.

An informal definition is the cumulative distribution function “cumulates” probabilities as a function of x. See the Chapter 3 additional notes for examples of a CDF involving the binomial distribution and the uniform distribution.

The reason why CDFs are used as link functions for binary data is because the CDF is always between 0 and 1.


3.25

Example: Logistic distribution (logistic_distribution.R)

Let X have a logistic probability distribution. The probability distribution function for X can be represented by

for -<x< and parameters -<< and >0. Note that E(X) = and Var(X) = 22/3 > 2.

Below is a plot of the distribution for =-2 and =2.

-15 -10 -5 0 5 10 15

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Logistic PDF with 2 and 2

x

f(x)

mu<--2sigma<-2

curve(expr = 1/sigma * exp(-(x-mu)/sigma) /(1+exp(-(x- mu)/sigma))^2, ylab = "f(x)", xlab = "x", from = -15, to = 15, main = expression(paste("Logistic PDF


3.26

with ", mu==-2, " and ",sigma==2)), col = "red")#Note that expr = dlogis(x, location=mu, scale=sigma) could also be used abline(h = 0)

The cumulative distribution function can be found by finding P(Xx):

Below is a plot of the CDF for =-2 and =2.

-15 -10 -5 0 5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

Logistic CDF with 2 and 2

x

F(x)

curve(expr = 1/(1+exp(-(x-mu)/sigma)), ylab = "F(x)", xlab = "x", from = -15, to = 15, lwd = 2, main =

expression(paste("Logistic CDF with ", mu==-2, " and ",sigma==2)), col = "red", panel.first = grid(col = "gray", lty = "dotted"))

#Note that expr = plogis(x, location=mu, scale=sigma) could also be used


3.27

Does this plot look familiar? See p. 3.8. This is the same function being plotted! Note that F(x) =

where =1 and =1/2. Then

Also notice that log[F(x)/(1-F(x))] = + x. Therefore, the logistic cumulative distribution function is used for “logistic” regression! (Note: One could say F-1(x) = log[F(x)/(1-F(x))], where F-1(x) is the inverse CDF.)

Example: Normal probability distribution

Let X have a normal probability distribution. The probability distribution for X can be represented by


3.28

The cumulative distribution function can be found by finding P(Xx):

Suppose =0 and 2=1. Then F(1.645) = 0.95, F(1.96) = 0.975, and F(2.576) = 0.995. Many textbooks will use () to denote the CDF of a standard normal distribution. Thus, (1.645) = 0.95.

In more familiar notation, Z1- = Z(1-) = Z(0.95) = 1.645 where =0.05. 1- represents the area to the left of 1.645 (for this example) of the probability distribution function. Note that other books may use Z where is the quantity in the “right” tail of the probability distribution function.

CDFs are nice to use for link functions with binary data since the CDF is always between 0 and 1. Two other commonly used link functions based on CDFs are:

Probit – based on the CDF of the standard normal distribution; the name comes from probit being a shortened version of “probability unit” (Hubert, 1992).

Random component: Y~Bernoulli()


3.29

Systematic component: + x

Link function: probit transformation

(x) = ( + x)

where () is the CDF of a standard normal distribution.

Then -1[(x)] = + x

-1[ ] is often called the “probit” transformation and denoted by probit( ). In general, this is often referred to as the inverse of the standard normal CDF. Thus,

probit[(x)] = + x (Note: similar to “logit”)

What does -1[ ] or probit[ ] represent? Here are a few examples: -1[0.95] = probit(0.95) = 1.645, -1[0.975] = 1.96, and -1[0.995] = 2.576.

Compare the probit transformation to the logit transformation. Remember the main purpose is to get a value of the function between 0 and 1 in order to model the probability of success.

Complementary log-log – based on 1 - CDF of the Gumbel (extreme value) distribution


3.30

Random component: Y~Bernoulli()


Link function: Complementary log-log transformation

The CDF of a Gumbel distribution is F(x) = for parameters -<< and

>0. Notice that 1- F(x) is still between 0 and 1. Also, note that E(X) = + where 0.577216 (Euler’s constant) and Var(X) = 22/6.

Let =-1/ and =/. Through the use of some algebra, the 1-CDF becomes 1-F(x) =

. Thus,

(x) =

Solving for the systematic component produces:

log{-log[1-(x)]} = + x

The “complementary” part of the name comes from 1-F(X) instead of F(X) being used.

Example: Compare the logistic, probit, and complementary log-log GLMs (pi_plot.R)


3.31

-15 -10 -5 0 5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

x vs. x for = 1 and = 0.5

x

x

LogitProbitCloglog

-15 -10 -5 0 5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

x vs. x for = 1 and = -0.5

x

x

LogitProbitCloglog


3.32

The R code used to create the data for the plots is below. alpha<-1beta<-0.5

par(pty="s")curve(expr = plogis(alpha+beta*x), from = -15, to = 15, col = "red", lwd = 2, lty = 1, main = expression(paste(pi(x), " vs. x for ", alpha, " = 1 and ", beta," = 0.5")), xlab = "x", ylab = expression(pi(x)), panel.first = grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted"))curve(expr = pnorm(alpha+beta*x, mean=0, sd=1), from = -15, to = 15, col = "blue", add = TRUE, lty = 2, lwd = 2)curve(expr = 1-exp(-exp(alpha+beta*x)), from = -15, to = 15, col = "green", add = TRUE, lty = 4, lwd = 2)

legend(locator(1), legend = c("Logit", "Probit", "Cloglog"), lty = c(1,2,4), lwd = c(2,2,2), col = c("red", "blue", "green"), bty = "n")

#There is a pgumbel(q, loc=0, scale=1, lower.tail = TRUE) function in the evd and VGAM packages

Notes: The logistic model corresponds to the model plotted on

p. 4.9. The logistic and probit intersect at (x)=0.5. Notice the logistic and probit curves are both

symmetric. This means that the curve when (x)<0.5 is the mirror image of the curve for (x)>0.5. The complementary log-log curve does not have this property.

When you fit these models to a data set, you should not expect all of the and ’s to be the same. Thus, these plots are a little misleading in some respect.


3.33

Which model should you use???

This is not an easy question to answer. The logit link provides a convenient way to interpret the

model through the use of odds and odds ratios. Notice the logit transformation is a log of an odds! Because of this aspect, the logit link will often be used over the other two. Chapter 5 focuses on the logit link.

The logit and probit links provide models that are often not too different. See the upcoming examples.

I have not seen the complementary log-log link used often; however, this does not mean it is not used in practice.

One way to decide between the three link functions is to use all three and see which one gives the best “fit”. This means which graphically fits the data the best (like on p. 3.42), which has the smallest residuals in absolute value, and which satisfies goodness-of-fit statistics the best.

Goodness-of-link function tests can be used to help determine which link function to use. These tests usually incorporate the link functions under one family of functions. For example, Aranda-Ordaz (1981) has incorporated the probit and logit transformations under one family of transformations. He gives a hypothesis test to help choose between them. A small discussion of these tests is available on p. 301 and p. 257-8 in Agresti (2002).


3.34

bilder, 12/20/07,

Also, other places


Probit model:

> mod.fit.probit<-glm(formula = good1 ~ dist, data = place.s, family = binomial(link = probit), na.action = na.exclude, control = list(epsilon = 0.0001, maxit = 50, trace = T))Deviance = 825.0748 Iterations - 1 Deviance = 776.0735 Iterations - 2 Deviance = 772.0135 Iterations - 3 Deviance = 771.9512 Iterations - 4

> summary(mod.fit.probit)

Call:glm(formula = good1 ~ dist, family = binomial(link = probit), data = place.s, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T))



(Dispersion parameter for binomial family taken to be 1)


> #Estimated probability of success for a 20 yard field goal


3.35

lin.pred<-mod.fit.probit$coefficients[1] + mod.fit.probit$coefficients[2]*20> pnorm(q = lin.pred, mean = 0, sd =1) 0.9744488

> #Estimated probability of success for a 50 yard field goal> lin.pred<-mod.fit.probit$coefficients[1] + mod.fit.probit$coefficients[2]*50> pnorm(q = lin.pred, mean = 0, sd = 1) 0.526936

Complementary log-log model:

> mod.fit.cloglog <-glm(formula = good1 ~ dist, data = place.s, family = binomial(link = cloglog), na.action = na.exclude, control = list(epsilon = 0.0001, maxit = 50, trace = T))Deviance = 836.9174 Iterations - 1 Deviance = 771.2283 Iterations - 2 Deviance = 769.4893 Iterations - 3 Deviance = 769.4776 Iterations - 4

> summary(mod.fit.cloglog)

Call:glm(formula = good1 ~ dist, family = binomial(link = cloglog), data = place.s, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T))



(Dispersion parameter for binomial family taken to be 1) 2010 Christopher R. Bilder

3.36



> #Estimated probability of success for a 20 yard field goal> lin.pred<-mod.fit.cloglog$coefficients[1] + mod.fit.cloglog$coefficients[2] * 20> 1-exp(-exp(lin.pred)) 0.977664

> #Estimated probability of success for a 50 yard field goal> lin.pred<-mod.fit.cloglog$coefficients[1] + mod.fit.cloglog$coefficients[2]*50> 1-exp(-exp(lin.pred)) 0.5477212

Notes:Models:

Estimated modelLogistic = 5.8121 – 0.1150xProbit probit[ ] = 3.2060 – 0.0628xComp. log-log log{-log[1- ]} = 2.3799 – 0.0522x

Estimated model

Logistic

Probit = (3.2060 – 0.0628x) Comp. log-log = 1 – exp[-exp(2.3799 –

0.05222x)]


3.37

Estimate probabilities:

Suppose you want to predict the estimated probability of success for a distance of 20 yards. For the probit model,

= (3.2060 – 0.0628*20) = (1.95) = 0.9744

For the complementary log-log model:

= 1 – exp[-exp(2.3799 – 0.0522220)] = 0.9777

To summarize,Distance

Logistic 20 0.9710Probit 20 0.9744Comp. log-log 20 0.9777

DistanceLogistic 50 0.5152Probit 50 0.5269Comp. log-log 50 0.5477

An easier way to find the estimate probabilities is to use the predict() function. Suppose the complementary log-log model is fit and the model fit summary information is stored in the mod.fit.cloglog


3.38

object. Then the predict() function can be used the following way to predict the probability of success at x=20:

> predict.data<-data.frame(dist=20)> predict(object = mod.fit.cloglog, newdata = predict.data, type = "response")[1] 0.977664

The type = “response” option is used to tell R that you want to predict . If you want to predict the linear predictor, use the type = “link” option.

> #predict the linear predictor> predict(object = mod.fit.cloglog, newdata = predict.data, type = "link")[1] 1.335410

To predict for more than one distance, create a data set with extra rows:

> #Predict for 20 and 50 yards> predict.data<-data.frame(dist = c(20, 50))> save.pi.hat<-predict(object = mod.fit, newdata = predict.data, type = "response")> data.frame(predict.data, pi.hat = round(save.pi.hat,4)) dist pi.hat1 20 0.97772 50 0.5477

Finally, one could also use the predict() function to find the standard error of . This information can be used to find approximate (1-)100% Wald confidence intervals for . The actual formulas will be discussed in Chapter 4.


3.39

> #Prediction with C.I.s> predict.data<-data.frame(dist = c(20, 50))> alpha<-0.05

> save.pi.hat<-predict(object = mod.fit, newdata = predict.data, type = "response", se.fit = TRUE)> lower<-save.pi.hat$fit-qnorm(1-alpha/2) * save.pi.hat$se.fit> upper<-save.pi.hat$fit+qnorm(1-alpha/2) * save.pi.hat$se.fit

> data.frame(predict.data, pi.hat = round(save.pi.hat$fit, 4), se =

round(save.pi.hat$se.fit,4), lower = round(lower,4), upper = round(upper,4))

dist pi.hat se lower upper1 20 0.9777 0.0046 0.9686 0.98672 50 0.5477 0.0303 0.4884 0.6070

Below is a plot of the estimated probabilities from all three of the models.


3.40

20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

Distance (yards)

Est

imat

ed p

roba

bilit

y


Complementary log-logLogitProbit

R code:par(pty = "m") #plots over all of graph - not squareplot(x = place.pattern$distance, y = place.pattern$prop, xlab="Distance (yards)", ylab="Estimated probability", main = "Estimated probability of success of a placekick \n with observed proportions", panel.first = grid(col = "gray", lty = "dotted"))

curve(expr = plogis(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x), col = "red", add = TRUE, lwd = 2, lty = 1)curve(expr = pnorm(mod.fit.probit$coefficients[1] + mod.fit.probit$coefficients[2]*x), col = "blue",


3.41

add = TRUE, lty = 2, lwd = 2)curve(expr = 1-exp(-exp(mod.fit.cloglog$coefficients[1] + mod.fit.cloglog$coefficients[2]*x)), col = "green", add = TRUE, lty = 4, lwd = 2)legend(locator(1), legend = c("Complementary log-log", "Logit", "Probit"), lty = c(4, 1, 2), bty = "n", col=c("green", "red", "blue"), cex = 0.75)

10 20 30 40 50 60

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Distance (yards)

Est

imat

ed p

roba

bilit

y


Complementary log-logLogitProbit

R code:# Bubble plot version with bubble proportional to sample size

symbols(x = place.pattern$distance, y = place.pattern$prop, circles = sqrt(place.pattern$n), xlab =


3.42

"Distance (yards)", ylab="Estimated probability", xlim = c(10,65), ylim = c(0, 1.2), main = "Estimated probability of success of a placekick \n with observed proportions", panel.first = grid(lty = "dotted")

curve(expr = plogis(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x), col = "red", add = TRUE, lwd = 2, lty = 1)curve(expr = pnorm(mod.fit.probit$coefficients[1] + mod.fit.probit$coefficients[2]*x), col = "blue", add = TRUE, lty = 2, lwd = 2)curve(expr = 1-exp(-exp(mod.fit.cloglog$coefficients[1] + mod.fit.cloglog$coefficients[2]*x)), col = "green", add = TRUE, lty = 4, lwd = 2)legend(locator(1), legend = c("Complementary log-log", "Logit", "Probit"), lty = c(4, 1, 2), lwd = c(2,2,2), bty = "n", col=c("green", "red", "blue"), cex = 0.75)


3.43

3.3 Generalized linear models for count data

Counts (for example, counts in a contingency table) of possible outcomes are non-negative integers. These are often modeled as Poisson random variables. Chapter 7 focuses on counts from a contingency table for multiple categorical variables. This section focuses on counts for a single categorical variable that do not necessarily appear in a contingency table.

Review:

Poisson distribution: for y=0,1,2,…

where Y is a random variabley denotes the possible outcomes of Y is a parameter

E(Y) = and Var(Y) = - this can be too restrictive

Poisson regression

To make the introduction easier, assume there is only one explanatory variable.

Random component: Y~Poisson()



3.44

Link function: log transformation

log[E(Y)] = log() = + x

= e + x = eex = e(e)x

Notice the effect of a change in x has on . Could call here “(x)” similar to what was done with in the previous section.

Question: Why do you think the log link is preferred over the identity link for count data?

Examples: Possible Y and X variables

Y = # of credit cards you haveY = # of arrests for a city per yearY = # of airplane crashes per yearY = # of cars stopped at the 33rd and Holdrege streets intersection

What variables could have an effect on Y? Suppose Y is # of credit cards:

X = income level, gender, where you live,…


3.45


Can not get negative estimates!

Example: Horseshoe crabs and satellites (horseshoe.R, horseshoe.txt)

See the video! Also, please see the description on p. 75 of Agresti (2007). Page 76-77 shows the entire data set.


3.46

More on the crabs: www.npr.org/templates/story/story.php?

storyId=106489695 http://www.ceoe.udel.edu/horseshoecrab

For each ith female, assume the number of satellites, Yi, has a Poisson distribution with mean i dependent on female shell width. We will model the expected number of satellites with the following model:

log(i) = + xi

where xi is the width of the ith female crab.

> #Read in data> crab<-read.table(file = "c:\\Chris\\UNL\\STAT875\\chapter4 \\horseshoe.txt", header=FALSE, col.names = c("satellite", "width"))


3.47

Notice how data was read in

http://www.ceoe.udel.edu/horseshoecrab/

http://www.npr.org/templates/story/story.php?storyId=106489695

http://www.npr.org/templates/story/story.php?storyId=106489695

> mod.fit<-glm(formula = satellite ~ width, data = crab, family = poisson(link = log), na.action = na.exclude, control = list(epsilon = 0.0001, maxit = 50, trace = T))Deviance = 759.6346 Iterations - 1 Deviance = 580.078 Iterations - 2 Deviance = 567.9793 Iterations - 3 Deviance = 567.8786 Iterations - 4 Deviance = 567.8786 Iterations - 5

> summary(mod.fit)

Call:glm(formula = satellite ~ width, family = poisson(link = log), data = crab, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T))

Deviance Residuals: Min 1Q Median 3Q Max -2.8526 -1.9884 -0.4933 1.0970 4.9221

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.30476 0.54222 -6.095 1.10e-09 ***width 0.16405 0.01996 8.217 < 2e-16 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for poisson family taken to be 1)



> #Predict for 23 and 30 widths> predict.data<-data.frame(width = c(23, 30))> alpha<-0.05> save.mu.hat<-predict(object = mod.fit, newdata =

predict.data, type = "response", se = TRUE) 2010 Christopher R. Bilder

3.48

> lower<-save.mu.hat$fit-qnorm(1-alpha/2)*save.mu.hat$se> upper<-save.mu.hat$fit+qnorm(1-alpha/2)*save.mu.hat$se> data.frame(predict.data, mu.hat = round(save.mu.hat$fit,4),

lower = round(lower,4), upper = round(upper,4)) width mu.hat lower upper1 23 1.5972 1.3074 1.88712 30 5.0359 4.3101 5.7618

#Plot of data and estimated model> plot(x = crab$width, y = crab$satellite, xlab = "Width (cm)", ylab = "Number of satellites", main = "Horseshoe crab data set \n with poisson regression model fit", panel.first = grid(col = "gray", lty = "dotted"))> curve(expr = exp(mod.fit$coefficients[1] +

mod.fit$coefficients[2]*x), col = "red", add = TRUE, lty = 1)

> #The is part of Table 3.3 on p. 80 of Agresti (2007). The last two "columns" are the number of cases and the number of satellites. The first "column" is the group width mean corresponding to the width categories given in Table 3.3. These means are stated on p. 90 of Agresti (1996). In the 2007 edition, he did not state them. However, these can be simply found as shown in my table3.3.R program.

> crab.tab3.3<-data.frame(width = c(22.69, 23.84, 24.77, 25.84, 26.79, 27.74, 28.67, 30.41), cases = c(14, 14, 28, 39, 22, 24, 18, 14), satell = c(14, 20, 67, 105, 63, 93, 71, 72))

> temp3<-matrix(data=temp2, nrow=8, ncol=3, byrow=T)> crab.tab4.3<-data.frame(width=temp3[,1], cases=temp3[,2], satell=temp3[,3])

> #Average number of satellites per group> mu.obs<-crab.tab4.3$satell/crab.tab4.3$cases

> points(x = crab.tab4.3$width, y = mu.obs, pch = 18, col = "darkgreen", cex = 2)

> legend(locator(1), legend="Diamonds are group mean", cex = 0.75)


3.49

22 24 26 28 30 32 34

05

1015

Width (cm)

Num

ber

of s

atel

lites

Horseshoe crab data set with poisson regression model fit

Diamonds are group mean

Notes: First examine the plot of the data above – ignoring the

estimated model plotted in red. The data show an upward trend. As the width increases, the number of satellites increases. This is easier to see with the group means (the grouping of the data comes from Table 4.3 of Agresti (1996, p. 90)). Remember that the Poisson regression model is modeling the MEAN response!


3.50

bilder, 12/26/07,

These are not stated in Agresti (2007). See my table3.3.R program for how to find them.

The glm() function fits the Poisson regression model to the data. Notice the use of the family = poisson(link = log) option.

The estimated Poisson regression model is

where x=width and is the mean number of satellites. The model could also be written as:

What happens to the estimated mean number of satellites as the width increases?

The estimated number of satellites for a particular width can be found from the model. For example, the estimated mean number of satellites for a width of 23 is

The estimated number of satellites for a width of 30 is 5.0359. See how the predict() function was used here.

The z value in the output gives a test for whether the corresponding parameter is 0 or not. This test statistic can be compared to a standard normal distribution. Is width important to predicting the mean number of satellites?


3.51

bilder, 12/23/05,

Baby got back! Sir-mix-alot song

See the R code used to create the plot. I had difficulty creating a legend with the diamond plotting character.

The plot is very important to do in order to determine if the model works for the data!

Table3.3.R provides a general way to find tables like Table 3.3 on p. 80 in Agresti (2007). The program also provides a general way to find categories (not the same as those in Table 3.3). This program code can be incorporated into your own program for future projects!!!

Negative binomial regression

A limiting assumption for a Poison distribution is that E(Y) = Var(Y) = . Sometimes, the variance of Y appears to be greater than for a data set. Evidence of this occurs in the horseshoe example. See Table 3.3 on p. 80 or part of it produced below from my Table3.3.R program.

> table3.3[,1:5] width.group number.cases number.sat mean.per.group var.per.group1 22.69286 14 14 1.000000 2.7692312 23.84286 14 20 1.428571 8.8791213 24.77500 28 67 2.392857 6.5436514 25.83846 39 105 2.692308 11.3765185 26.79091 22 63 2.863636 6.8852816 27.73750 24 93 3.875000 8.8097837 28.66667 18 71 3.944444 16.8790858 30.40714 14 72 5.142857 8.285714

If the Poisson assumptions were satisfied, we would expect the mean.per.group column to be approximately the same as the var.per.group column. Obviously, this is


3.52

not happening here. Note that this is an “ad-hoc” way to show the variance is larger than the mean (due to the artificial grouping of the data), but it still shows evidence toward possible problems.

When the variance is larger than the mean, this is called overdispersion, and it is a violation of our model. Thus, inferences made using the model may be incorrect. What can you do when this occurs? 1.Find more explanatory variables that help explain the

variability in the response variable! The additional variability could be due to not accounting for other explanatory variables. For example, perhaps crab weight plays an important role in estimating the mean number of satellites. Without accounting for weight and using width only, there can be additional satellite variability than expected at individual widths. See Agresti (2007) on p. 80-1 for a further good explanation.

2.Page 151 of Agresti (2002) discusses quasi-Poisson regression models. These models do not assume a full parametric form for the model and can be estimated with the glm() function by using a family = quasipoisson(link = log) option. See the additional Chapter 3 notes for more information. Agresti (2007) does not discuss these models in Chapter 3 (a little on p. 280), so they will not be discussed here.


3.53

3.Poisson generalized linear mixed models, which are explained in Section 13.5 of Agresti (2002).

4.Agresti (2007) discusses negative binomial models, so these will be presented next.

One way to write the negative binomial distribution is

for y = 0, 1, …

This distribution occurs when one is interested in the probability of y failures before the kth success (see Casella and Berger (2002, p. 95) if you are interested in more detail). For us, there are two important aspects to this distribution. First, the values of Y are non-negative integers just like a Poisson random variable. Second, the distribution can be rewritten as

for y = 0, 1, …, and k>0

where E(Y) = and Var(Y) = + 2/k. Notice that this is very similar to what we had for a Poisson random variable, but now we have a larger variance for Y! The parameter k is a measure of the “over” dispersion. Note that Agresti (2007) officially defines D = 1/k as the “dispersion parameter”. As 1/k goes to 0, we approach what the Poisson distribution would obtain. More in-


3.54

bilder, 01/07/09,

As 1/k goes to 0, neg bin goes to Poisson

bilder, 01/07/09,

See Section 13.5.3 for comparisons and Agresti favoring the neg bin models

depth information for how this distribution comes about is available on p. 559-561 of Agresti (2002) if you are interested.


The glm() function can not fit this specific model so we will need to use the glm.nb() function in the MASS package. This package comes with an initial installation of R, but you will still need to tell R that you want to use it.

> library(MASS)> mod.fit.nb<-glm.nb(formula = satellite ~ width, data = crab, link = log)> summary(mod.fit.nb)

Call:glm.nb(formula = satellite ~ width, data = crab, link = log, init.theta = 0.904568080033865)


Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.05251 1.17143 -3.459 0.000541 ***width 0.19207 0.04406 4.360 1.30e-05 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(0.9046) family taken to be 1)


3.55

bilder, 01/07/09,

0.1641 before

bilder, 01/07/09,

-3.305 before

bilder, 01/07/09,

This distribution is a marginal distribution for y where the usual Poisson parameter is assumed to have a gamma distribution! Y|lambda ~ Poisson with lambda ~ Gamma(k,mu). E(lambda) = mu and Var(lambda) = mu^2/k. Then y|k, mu is neg bin. As 1/k goes to 0, Var(lambda) goes to 0 so we have a degenerate distribution



Correlation of Coefficients: (Intercept)width -1.00

Theta: 0.905 Std. Err.: 0.161

2 x log-likelihood: -751.291

The estimated negative binomial regression model is

with = 0.905 and = 1/0.905 = 1.1. Thus, the estimated variance of Y is now (remember that

depends on the value of x).

Where does this larger variance show up in the analysis? One place is in (without going into the formula details). Looking in the usual spot in the output, we obtain a value of 0.04406. The corresponding value in the Poisson regression model was 0.01996. Why does this larger variance matter?

How could one test if there was evidence of overdispersion?


3.56

bilder, 12/26/07,

See output std. err.

bilder, 12/24/07,

Test beta = 0

Poisson regression for rate data

Rate data consists of the rate that a number of events occur for some time period or other baseline measure. Examples include: the number of times a computer crashes during a time period, number of melanoma cases per city size, number of arrivals at airports for a particular time period,… .

The time period or baseline measure needs to be incorporated into the analysis. One way to do this is to model Y/t instead of just Y where Y is the number of events and t is the time period or baseline measure. Thus, the Poisson regression model becomes:

log(/t) = + x

where =E(Y). This expression can be simplified to

log() – log(t) = + x log() = + x + log(t).

log(t) is called an “offset”. Notice how the offset has an effect on :

= e + x + log(t) = t e + x

Thus, t helps to adjust the “usual” mean (e + x) by the time period or baseline measure.


3.57


The data need not be the same amount of time for each airport. For example, suppose Omaha's Epply Airfield has 30 arrivals during a 60 minute period of time and suppose OKC's airport has 25 arrivals during a 100 minute period of time.


Loughin's example


This is not necessarily the best example where one would want to use Poisson regression for rate data, but it gives a nice illustration of the relationship between a Poisson model for rate data and “regular” data. Please see p. 83 of Agresti (2007) for another example where using rate data is more appropriate.

Suppose the data was given in the form of the number of satellites per distinct width. Let Y be the number of satellites for a distinct width. Let t be the number of female crabs observed for a distinct width. For example, there are t=3 crabs with a width of 22.9 cm and they have a total of Y=4+0+0 = 4 satellites.

Before the data set looked like this:Crab ID Satellites Width

1 8 28.32 0 22.53 9 26.04 0 24.8

Now the data set looks like this:

# of crabs (t) Total satellites (Y) Width1 0 21.0


3.58

# of crabs (t) Total satellites (Y) Width1 0 22.03 5 22.53 4 22.9

> library(nlme) #gsummary function is located hereLoading required package: lattice

> sum.rate.data<-gsummary(object = crab, FUN = sum, groups = crab$width) > length.rate.data<-gsummary(object = crab, FUN = length, groups = crab$width)

> rate.data<-data.frame(y=sum.rate.data$satellite, t=length.rate.data$satellite, width=length.rate.data$width) > mod.fit.rate<-glm(formula = y ~ width+offset(log(t)), data = rate.data, family = poisson(link = log), na.action = na.exclude, control = list(epsilon = 0.0001, maxit = 50, trace = T))Deviance = 211.7379 Iterations - 1 Deviance = 190.2969 Iterations - 2 Deviance = 190.0273 Iterations - 3 Deviance = 190.0272 Iterations - 4 Deviance = 286.3955 Iterations - 1 Deviance = 255.2993 Iterations - 2 Deviance = 254.9404 Iterations - 3 Deviance = 254.9403 Iterations - 4

> summary(mod.fit.rate)

Call:glm(formula = y ~ width + offset(log(t)), family = poisson(link = log), data = rate.data, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T))

Deviance Residuals: Min 1Q Median 3Q Max


3.59

-3.8003 -1.4515 -0.3788 0.6619 4.7586





> #Plot of data with estimated mu's; notice the use of the panel.first option to put grid lines behind plotting points> plot(x = crab$width, y=crab$satellite, xlab="Width (cm)", ylab="Number of satellites", panel.first=grid(col = "gray", lty = "dotted"), main = "Horseshoe crab data set \n with poisson regression model fit (rate data)")> points(x = rate.data$width, y = mod.fit.rate$fitted.values, pch = 18, col = "darkgreen", cex = 1)> legend(locator(1), legend="Diamonds are predicted value", cex = 0.75)


3.60

22 24 26 28 30 32 34

05

1015

Horseshoe crab data set with poisson regression model fit (rate data)

Width (cm)

Num

ber

of s

atel

lites

Diamonds are predicted values

Notes: The gsummary() function allows one to summarize a

data set by a grouping variable. This is similar to using a SAS procedure with a BY statement. In this case, I sum the satellites over the different crab widths. Also, I find the number of satellites per crab width. These are combined into the data.frame called rate.data. Note that the gsummary() function is in the nlme package so this package needs to be loaded first.

The glm() function is used to fit the Poisson regression model with an offset. Notice the parameter estimates


3.61

are the same as before! Below is the estimated model with the offset:

where t = number of crabs per distinct width. Generally, the parameter estimates will be displayed

as being exactly the same. Through some statistical research that I have worked on, I have found some situations where there are some differences.

Why are there two sets of iterations here for glm()? In the glm() function code, the model with an intercept ONLY is fit once and then the whole model is fit. This occurs only when an offset is used. I think the reason is due to what an intercept only model represents with rate data.

Notice that a smooth curve can not be plotted because of the different number of crabs per width.

A better version of the plot would include different colors for the plotting characters (corresponding to each t value) for each observed Y and predicted .

More plots of the model are discussed in the Chapter 3 additional notes.


3.62

bilder, 12/06/06,

Simple mean can not be used to estimate beta0 - see my 10-18-06 MRCV research journal for more information

bilder, 12/06/06,

10-18-06 MRCV research journal

3.4 Statistical inference and model checking

One of the best things about GLMs is that they provide a unified approach to test model parameters, check goodness-of-fit, examine residuals, estimate parameters, … . Thus, one can use the same basic methods for logistic, probit, complementary log-log, and Poisson regression.

The Wald and likelihood ratio tests

A hypothesis test commonly of interest is

Ho:=0 Ha:0

Below are two different ways this test can be conducted: Wald - The test statistic is

where SE stands for “standard error”. Actually, this standard error is an estimate of the “asymptotic” standard error. Often, you will see the standard error

here denoted as or . For large n,

remember that an MLE ( here) has an approximate normal distribution. Thus, Z has an approximate


3.63

bilder, 12/26/07,

In Agresti (1996), he used ASE instead

standard normal distribution and this distribution can be used to perform the test.

On p. 1.28 of the notes, the “large sample variance” was introduced for where y is 0 or 1 and n is number of trials. The formula given was:

This formula can be used here also by using the likelihood function for and instead. Since there are two parameters, a matrix of the second partial derivatives is found:

=

The “large sample variance” for is the (2,2) element of the above matrix. The square root of this quantity is the SE that we are using in the denominator of Z.


3.64

Notes: Try to write out the likelihood function for a logistic

regression or Poisson model on your own. Then try to write out the matrix of second partial derivatives.

You will never need to actually do the evaluation of the formula of the large sample variance by hand, but R will do it routinely for us!

The same problems that we have had before using Wald confidence intervals happen here. Therefore, we need to make sure the sample size is large. The next method is a little better to use when the sample size is not large.

Likelihood ratio test (LRT) - We have discussed the LRT before in Chapters 1-2. This procedure can also be used here.

Review from p. 1.29 and 2.58: The LRT statistic is

Remember that the ratio is between 0 and 1 since the numerator can not exceed the denominator.

For the test of =0 vs. 0, the numerator is calculated assuming =0. Thus, the model fit to the data is only g() = (where g() denotes the link function). The denominator is calculated without the


3.65

assumption that =0. Thus, the model fit to the data is g() = + x. The likelihood functions are found using the fit of both models and the ratio is found. For example, the ratio becomes for logistic regression:

where the and .

The actual test statistic used for a LRT is –2log(). The reason is because this statistic has an approximate 2 distribution for large n. The degrees of freedom are found the same way as before. In this case, notice the difference between Ho and Ha is whether or not =0. Thus, the 2 distribution has 1 degree of freedom. Note that –2log() is often denoted in categorical data analysis as G2.

Often in computer output, –2log() is not given directly. Instead, what is often given is the “null deviance” and the “residual deviance”. These are –2log() statistics themselves, but for testing a different set of hypotheses.


3.66

Simply put, the –2log() for a test of Ho:=0 vs. Ha:0 is:

null deviance – residual deviance

Below is a further explanation of the two deviances. The null deviance tests:

Ho: Model with only Ha: Model using the observed values

The test statistic for Poisson regression is

where is .

Compare the above form to what we saw on p. 2.59. The test statistic for logistic regression is

where

.

Questions: o What is for the Poisson regression model? o What is for the logistic regression model?


3.67

bilder, 12/23/05,

For the placekicking data: alpha_hat_0 = 2.0467; e^(alpha_hat_0) / (1+e^(alpha_hat_0)) = 0.8856 - see end of the placekick_ch4.R program

unl, 12/11/03,

exp(y_bar) since y does not change over values of x


proportions or means could be - pi_hat_i = # of successes / # of trials

The residual deviance tests:

Ho: Model with only and Ha: Model using the observed values

The test statistic for Poisson regression is

where is .

The test statistic for logistic regression is

where

.

Notice that and both have a few things in common. When the residual deviance is subtracted from the null deviance, the resulting statistic for Poisson regression is:


3.68


proportions or means could be

For logistic regression, the statistic becomes

.

Without going into the details, these are the correct –2log() statistics for the test of Ho:=0 vs. Ha:0!

The word “deviance” is used because the statistics give a measurement of how much the observed data “deviates” from the model’s fit.


Perform the test of Ho:=0 vs. Ha:0. The output from glm() is reproduced below.

> summary(mod.fit)

Call:glm(formula = good1 ~ dist, family = binomial(link = logit), data = place.s, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T))


3.69


Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.812079 0.326158 17.82 <2e-16 ***dist -0.115027 0.008337 -13.80 <2e-16 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for binomial family taken to be 1)

Null deviance: 1013.43 on 1424 degrees of freedomResidual deviance: 775.75 on 1423 degrees of freedomAIC: 779.75Number of Fisher Scoring iterations: 5

The Wald test statistic is Z=-13.80. Since Z0.975=1.96, 0 with 95% confidence. Also, the p-value is very small. Therefore, distance is important for predicting the probability of success.

To find the -2log(), use the null and residual deviance:

= 1013.43 – 775.75 = 237.68

The degrees of freedom given from the output for the null and residual deviance can also be subtracted in the same way to find the degrees of freedom for the test:

1424 – 1423 = 1

Below is the R code and output to perform the LRT: > #LRT: -2log(lambda)


3.70

> mod.fit$null.deviance - mod.fit$deviance[1] 237.6811 > #DF> mod.fit$df.null-mod.fit$df.residual[1] 1 > #p-value> 1 - pchisq(q = mod.fit$null.deviance – mod.fit$deviance, df = mod.fit$df.null- mod.fit$df.residual)[1] 0

Since the p-value is very small, 0. Therefore, distance is important for predicting the probability of success. In Chapter 5, we will see that it is not appropriate to perform the test as done here. More will be discussed about it at that time.

Here are some additional details showing how R calculates :

> y<-place.s$good> pi.hat<-mod.fit$fitted.values> pi.tilde<-y > 2*(sum(log(y^y)) - sum(y*log(pi.hat)) + sum(log((1-y)^(1-y))) - sum((1-y)*log(1-pi.hat))) #Need to do second part with pi^y due to 0 pi values[1] 775.745

> #Discussed in next chapter > dev.resid<-resid(mod.fit, type="deviance")^2> sum(dev.resid)[1] 775.745


3.71

Question: Suppose you wanted to test Ho:=0 vs. Ha:0 for models with a probit or complementary log-log link. How would you do it?


Perform the test of Ho:=0 vs. Ha:0. The output from glm() is reproduced below. Note that the model with the offset is used here!

> summary(mod.fit)






Number of Fisher Scoring iterations: 4 2010 Christopher R. Bilder

3.72

The Wald test statistic is Z=8.22. Since Z0.975=1.96 and the p-value < 2*10-16, 0. Therefore, width is important for predicting the mean number of satellites.

-2log() = = 254.9403 - 190.0272 = 64.91

> #LRT: -2log(lambda)mod.fit$null.deviance - mod.fit$deviance[1] 64.91309

> #p-value> 1-pchisq(q = mod.fit$null.deviance-mod.fit$deviance,

df = mod.fit$df.null-mod.fit$df.residual)[1] 7.771561e-16

Since the p-value is very small, 0. Therefore, width is important for predicting the mean number of satellites.


3.73

NOTE!

In Section 3.4.5, Agresti (2007) talks about “goodness-of-fit” statistics and model residuals more so in the context of Poisson regression models only. In Chapter 5, these items are discussed for logistic regression models and in much more detail. The reason for the separation is because there are a few things one needs to watch out for in logistic regression that does not happen as much in Poisson regression. Thus, the rest of the discussion in this section will only be for Poisson regression.


3.74

Model residuals

Pearson residuals can be calculated in a similar manner as described in Chapter 2. The Pearson residual in Chapter 2 was

where nij was the cell count for row i and column j, was its estimated value under the hypothesis of independence, and square root of the estimated variance (remember for a Poisson random variable, mean=variance). A Pearson residual has an approximate standard normal distribution provided the

is not small (>2 or 5).

The same set-up can be used here for the Pearson residual from a Poisson regression model. For the Poisson regression model:

where the yi is the ith observed value for the dependent variable, is its predicted value.


3.75

In Chapter 2, we also learned about a standardized residual. The standardized residual has a distribution that is closer to a standard normal distribution than the Pearson residual. The standardized residual is

where hi is the ith diagonal value of the hat matrix. What is the hat matrix?

With respect to regular regression analysis, you can see my Chapters 5 and 10 STAT 870 notes at www.chrisbilder.com/stat870/schedule.htm. With respect to Poisson regression, let X be a n2 matrix with 1’s in the first column and the explanatory variable values in the second column. Create a diagonal matrix, , with diagonal elements of in the same order as the corresponding explanatory variable values listed in X. The hat matrix is H =

X(X X)-1X . Note that this is similar to the hat matrix used when fitting a regression model by weighted least squares.

Note that the “standardized” residual may also be called elsewhere an “adjusted Pearson residual”, “adjusted residual” (Agresti, 1996, uses this term), or “studentized residual”.


3.76

http://www.chrisbilder.com/stat870/schedule.htm

The standardized residual can be calculated in R using

h<-lm.influence(model = mod.fit)$h Pearson<-residuals(object = mod.fit, type="pearson")standard.pearson<-Pearson/sqrt(1-h)

assuming mod.fit contains the model fit from glm().

We can use a standard normal approximation for both the Pearson and standardized residuals. Of course, the approximation works better with the standardized.

Question: Suppose the standardized residuals are greater than 2.576 or less than -2.576. What does this mean about the model?


> pearson1<-residuals(object = mod.fit, type="pearson") > #Standardized Pearson residuals> h<-lm.influence(model = mod.fit)$h > head(h) 1 2 3 4 5 6 0.009852678 0.015152453 0.006360592 0.008647581 0.006360592 0.011358140

> standard.pearson<-pearson1/sqrt(1-h)> head(standard.pearson) 1 2 3 4 5 6 2.1569835 -1.2223348 3.9641123 -1.4712657 0.8609526 -1.3572621

> X<-model.matrix(mod.fit)> #Also could use mu.hat<-mod.fit$fitted.values here> mu.hat<-predict(object = mod.fit, type = "response")


3.77

bilder, 12/26/07,

Start discussion about 1% of the time this is expected. Also discuss what could be done about the model (similar to regression class). We will not get into details here. Rather, we will go into details for the logistic regression chapter.

bilder, 12/26/07,

hatvalues() could also be used

> H<-diag(sqrt(mu.hat))%*%X%*% solve(t(X)%*%diag(mu.hat)%*%X)%*% t(X)%*%diag(sqrt(mu.hat))> diag(H)[1:5][1] 0.009852370 0.015150506 0.006360719 0.008647445 0.006360719

Notes: The residuals() function finds the residuals. There are a few functions that help you find the hat

matrix diagonal values. One is the lm.influence() function. There are no direct functions for the standardized residuals.

See how the matrix calculations are done in R. You are not responsible for this content.

> par(mfrow = c(2,1)) #2x1 grid of plots> #Pearson residual vs observation number plot> plot(x = 1:length(pearson1), y = pearson1, xlab="Observation number", ylab="Pearson residuals", main = "Pearson residuals vs. observation number")> abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3, col="red")

> #Standardized residual vs observation number plot> plot(x = 1:length(standard.pearson), y = standard.pearson, xlab="Observation number", ylab="Standardized residuals", main = "Standardized residuals vs. observation number")> abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3, col="red")


3.78

0 50 100 150

-20

24

6

Pearson residuals vs. observation number

Observation number

Pea

rson

resi

dual

s

0 50 100 150

-20

24

6

Standardized residuals vs. observation number

Observation number

Sta

ndar

dize

d re

sidu

als

Notes: The abline() function was used to draw lines on the plot

at Z0.975 and Z0.995. Notice it takes one call to the function for the lines.

Both plots are quite similar. Since we have only one explanatory variable, it is often helpful to plot these residuals vs. the explanatory variable.


3.79

> par(mfrow = c(1,1)) > # Residual vs width plot> plot(x = crab$width, y = standard.pearson, xlab="Width", ylab="Standardized Pearson residuals", main = "Standardized Pearson residuals vs. width")> abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3, col="red")

22 24 26 28 30 32 34

-20

24

6

Standardized residuals vs. width

Width

Sta

ndar

dize

d re

sidu

als

Notice the patterns among the plotting points. It is not unusual to see these types of patterns when one is modeling a discrete response variable. The plot below shows you why these patterns are occurring.


3.80

> plot(x = crab$width, y = standard.pearson, xlab="Width", ylab="Standardized Pearson residuals", main = "Standardized Pearson residuals vs. width", type = "n")> text(x = crab$width, y = standard.pearson, labels = crab$satellite, cex=0.75)> abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005), lty=3, col="red")

22 24 26 28 30 32 34

-20

24

6

Standardized residuals vs. width

Width

Sta

ndar

dize

d re

sidu

als

8

0

9

0

4

00

0000 0

11

0

14

8

1

10

5

4

3

1 2

3

0

3

5

00

4

00

8

5

00

6

0

6

3

5 6

5

9

4

64

3

3

5

5

64

5

15

33

00

0

5

3

5

1

8

10

0 0

3

7

1

0

6

00

3

4

0

5

00

0

4

0

3

00

0 0

5

000

010

11

11

1

1

4

111

1

2

4

3

6

0

2

2

0

12

0

56

6

20

2 3

0

3

4

2

66

0

4

10

7

0

55

6

6

73

3

0 0

8

4

4

10

9

4

000 0

4

0

2

0

4

43

8

0

7

00

2

34

000

The model looks to have model fit problems when there are a larger number of satellites than expected at lower widths (relative to the observations with a particular number of satellites). This may be a result of the overdispersion that we saw earlier.


3.81

The negative binomial model could also be fit to the data. The same types of residuals can be found with the corresponding adjustments to reflect the new model.

> pearson.nb<-residuals(object = mod.fit.nb, type="pearson")> h.nb<-lm.influence(model = mod.fit.nb)$h > standard.pearson.nb<-pearson.nb/sqrt(1-h.nb) > par(mfrow = c(1,2))> plot(x = 1:length(standard.pearson.nb), y = standard.pearson.nb, xlab="Obs. number", ylab="Standardized residuals", main = "Stand. residuals (NB model) vs. obs. number")> abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3, col="red")

> plot(x = crab$width, y = standard.pearson.nb, xlab="Width", ylab="Standardized residuals", main = "Stand. residuals (NB model) vs. width", type = "n")> text(x = crab$width, y = standard.pearson.nb, labels = crab$satellite, cex=0.75)> abline(h = qnorm(c(0.975, 0.995, 0.025, 0.005)), lty=3, col="red")


3.82

0 50 100 150

-10

12

3

Stand. residuals (NB model) vs. obs. number

Obs. number

Sta

ndar

dize

d re

sidu

als

22 24 26 28 30 32 34

-10

12

3

Stand. residuals (NB model) vs. width

Width

Sta

ndar

dize

d re

sidu

als

8

0

9

0

4

0000 00 0

11

0

14

8

1

10

5

4

3

12

3

0

3

5

00

4

00

8

5

00

6

0

6

3

5

6

5

9

4

6

4

3

3

5

5

6

4

5

15

3

3

00 0

5

3

5

1

8

10

00

3

7

1

0

6

00

3

4

0

5

00

0

4

0

3

000 0

5

00 00

10

1

1

11

1

1

4

1

11

1

2

4

3

6

0

2

20

12

0

5

6

6

2

0

2

3

0

3

4

2

6

6

0

4

10

7

0

5

5

6

6

73

3

0 0

8

4

4

10

9

4

000 0

4

0

2

0

4

4

3

8

0

7

00

2

3

4

000

As we can see there are not as many standardized residuals outside of the 2.576 borderlines. How many standardized residuals would you expect outside of these borderlines with n = 173?

Comments: 2010 Christopher R. Bilder

3.83

I am a little concern about how large these two standardized residuals are. One could examine these observations more closely like what you would do in a STAT 870 class. Due to time considerations, I am not going to do this here.

I am also a little concern with there being no standardized residuals less than -1. Remember that a normal distribution is being used here. Do you think a normal approximation will work for these observations toward the bottom of these plots?

One possible solution to the normal approximation problem is to work with the rate data formulation of the model. Why? See the additional Chapter 3 notes for details.

Goodness-of-fit

The Pearson statistic and LRT can be both used to assess how well (good) the model fits the data versus using just the “observed” values at the explanatory variable levels. This model is often called the “saturated” model since it has the most possible parameters. The saturated model estimates a parameter for every observation. For example, the saturated model for Poisson regression is log(i) = + i

for i = 1, …, n results in (Note: A restriction on the i’s is needed such as like you would see in


3.84

bilder, 12/26/07,

May have mu_hat_j larger.

bilder, 12/26/07,

Look back at the Pearson residual guidelines. Problems with these approximations frequently happen with GLMs. Chapter 5 will examine this more closely with logistic regression models.

STAT 802 or 870). Also, see the previous LRT work with saturated models.

Pearson statistic:

For Poisson regression, the statistic is:

The statistic can be approximated by a 2 distribution with n - # of model parameters = n – 2 degrees of freedom for large n. In order for the 2 approximation to work well, should not be small.

LRT statistic:

For Poisson regression, the statistic simplifies from -2log() to

where

This statistic is often denoted by G2 and was already introduced on p. 3.68. It can be approximated by the same distribution as used with the Pearson statistic, and it has the same potential problems.


3.85

Example: Horseshoe crabs and satellites (horseshoe.R, Table3.3.R, horseshoe.txt)

> summary(mod.fit)

Call:glm(formula = satellite ~ width, family = poisson(link = log), data = crab, na.action = na.exclude, control = list(epsilon = 1e-04, maxit = 50, trace = T))


Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.30476 0.54222 -6.095 1.10e-09 ***width 0.16405 0.01996 8.217 < 2e-16 ***---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1




> #LRT: -2log(lambda)> mod.fit$deviance[1] 567.88 > #p-value> 1-pchisq(q = mod.fit$deviance, df = mod.fit$df.residual)[1] 0 > #Pearson statistic> sum(pearson1^2)


3.86

[1] 544.157> 1-pchisq(q = sum(pearson1^2), df = mod.fit$df.residual)[1] 0

The p-values for the LRT and the Pearson statistic test are quite small indicating evidence of lack of fit. However, one should be concerned with the chi-square approximation working here. What can be done then? There are no choices that always work. Here are two possibilities.

1) Convert the data to a rate data format and perform the same tests.

Note that there are still a number of times where <5. Therefore, the 2 distribution approximation may be poor here as well. Below is part of the output given previously from the glm() function.

> summary(mod.fit.rate)



Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.30476 0.54218 -6.095 1.09e-09 ***


3.87

width 0.16405 0.01996 8.217 < 2e-16 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1




The “residual deviance” given in the output is G2. To find the Pearson statistic, sum the squared Pearson residuals. Below is the R code for both the goodness-of-fit tests.

> #LRT p-value> 1-pchisq(q = mod.fit.rate$deviance, mod.fit.rate$df.residual)[1] 1.998401e-14 > #Pearson statistic and p-value> pearson.rate<-resid(object = mod.fit.rate,

type="pearson")> sum(pearson.rate^2)[1] 174.2737> 1-pchisq(q = sum(pearson.rate^2), mod.fit.rate$df.residual)[1] 3.759215e-12

Both statistics indicate the model does not fit the data well – if we believe the 2 distribution approximation.

2) Form artificial groups (like in Table 3.3) and compute ad-hoc versions of these tests.


3.88

The purpose of forming these groups is to have each group have a mean value larger than 5 or so that we avoid the previous problems. In order to form a Pearson statistic, one can fit the model as usual and compute Pearson residuals for groups of size nk containing “alike” observations to result in

. The sum of these squared Pearson residuals then form a Pearson statistic. A LRT statistic can be found in the corresponding manner.

There are two problems with this approach: a) There are many different ways to form the groups,

and one could choose a variety of different number of groups. Your answers could change due to these choices.

b) The usual type of distributional approximation is chi-square with g – 2 degrees of freedom where g is the number of groups and two parameters are being estimated ( and ). The statistics though do not have this same type of chi-square distribution so formal hypothesis tests should not be done.

While this approach does have its problems, I like to use it as an informal way to assess the model overall along with graphical approaches if possible. Table3.3.R shows a few different ways to evaluate the model in this manner. Using the Table 3.3 categories, below is the Pearson and LRT statistics along with a


3.89

graphical assessment. The code for these results is available in the program.

22 24 26 28 30 32 34

05

1015


Width (cm)

Num

ber o

f sat

ellit

es

Table 3.3 obs. meansTable 3.3 predicted (using my interpret)

Ad-hoc

statistic Value

Degrees of

freedom p-valueX2 6.48 6 0.37


3.90

G2 6.89 6 0.33

Below is a more general way (also in the program) to assess the fit of the model using different groups.

>###############################################################> # More general way to put observations into classes > #Find 8 (9 quantiles) groups (why 8? Since Agresti had chosen 8 - other choices could have been made)> cutoff<-quantile(crab$width, probs = 0:8/8, na.rm = F) > cutoff 0% 12.5% 25% 37.5% 50% 62.5% 75% 87.5% 100% 21.00 23.85 24.90 25.65 26.10 26.90 27.70 28.70 33.50

> #Use midpoint for the width group designation; note that I could have used the mean width among all crabs within the group as well - there is not one correct way to do this. > groups<-ifelse(crab$width<cutoff[2], (cutoff[2]+cutoff[1])/2, ifelse(crab$width<cutoff[3], (cutoff[3]+cutoff[2])/2, ifelse(crab$width<cutoff[4], (cutoff[4]+cutoff[3])/2, ifelse(crab$width<cutoff[5], (cutoff[5]+cutoff[4])/2, ifelse(crab$width<cutoff[6], (cutoff[6]+cutoff[5])/2, ifelse(crab$width<cutoff[7], (cutoff[7]+cutoff[6])/2, ifelse(crab$width<cutoff[8], (cutoff[8]+cutoff[7])/2, (cutoff[9]+cutoff[8])/2))))))) > library(nlme) #Need package for the gsummary() function – don't need to rerun if already did before > crab.group<-data.frame(crab2, groups)> sat.count<-gsummary(object = crab.group, FUN = length, groups = groups)> sat.sum<-gsummary(object = crab.group, FUN = sum, groups = groups) > new.table3.3<-data.frame(width.group = sat.count$groups, number.cases = sat.count$satellite, number.sat = sat.sum$satellite, mean.per.group = sat.sum$satellite/sat.count$satellite, fitted.count = round(sat.sum$predicted,1),


3.91

Pearson.residual = round((sat.sum$satellite – sat.sum$predicted)/sqrt(sat.sum$predicted),2))

> new.table3.3 width.group number.cases number.sat mean.per.group fitted.count Pearson.residual1 22.425 22 20 0.9090909 35.6 -2.622 24.375 21 40 1.9047619 42.4 -0.363 25.275 22 60 2.7272727 50.5 1.344 25.875 20 68 3.4000000 50.9 2.405 26.500 23 47 2.0434783 64.4 -2.176 27.300 20 69 3.4500000 64.6 0.557 28.200 22 102 4.6363636 81.9 2.238 31.100 23 99 4.3043478 114.8 -1.48 > #Pearson statistic> cat("Ad-hoc Pearson statistic:", round(sum(new.table4.3$Pearson.residual^2),2), "with 6 DF results in a p-value of", round(1- pchisq(sum(new.table4.3$Pearson.residual^2), 6),2), "using a chi-square distribution approximation \n")

Ad-hoc Pearson statistic: 26.72 with 6 DF results in a p-value of 0 using a chi-square distribution approximation > #G^2> G.sq2<-2*sum(new.table4.3$number.sat *log(new.table4.3$number.sat/new.table4.3$fitted.count))> cat("Ad-hoc G^2 statistic:", round(G.sq2,2), "with 6 DF results in a p-value of", round(1-pchisq(G.sq2, 6),2), "using a chi-square distribution approximation \n")

Ad-hoc G^2 statistic: 27.29 with 6 DF results in a p-value of 0 using a chi-square distribution approximation > #This is interesting that these two measures suggest the model does not fit well! I would hope> # that goodness-of-fit conclusions would be invariant to the way one chooses to group the observations> # Possibly, this is example of why ad-hoc procedures can not always be trusted. > #Visual assessment> win.graph(width = 6, height = 6, pointsize = 10)> plot(x = crab$width, y = crab$satellite, xlab = "Width


3.92

(cm)", ylab = "Number of satellites", main = "Horseshoe crab data set \n with poisson regression model fit", panel.first = grid(col = "gray", lty = "dotted"))> curve(expr = exp(mod.fit$coefficients[1] + mod.fit$coefficients[2]*x), lty = 1, col = "red", add = TRUE)> points(x = new.table4.3$width.group, y = new.table4.3$mean.per.group, pch = 18, col = "darkgreen", cex = 2)

> #Notice these points are not on the estimated model line; probably due to using group average value for x-axis instead of > # weighted mean like did for the previous plot> points(x = new.table4.3$width.group, y = new.table4.3$fitted.count/new.table4.3$number.cases, pch = 17, col = "darkblue", cex = 2) > #Put group breaks on plot> for (i in (2:8)) { abline(v = cutoff[i], lty = 1, col = "lightgreen") }> > legend(locator(1), legend = c("Obs. group means", "Predicted group means (using my interpret)"), pch = c(18,17), col = c("darkgreen","darkblue"), cex = 0.75, bg = "white")


3.93

22 24 26 28 30 32 34

05

1015


Width (cm)

Num

ber o

f sat

ellit

es

Obs. group meansPredicted group means

With the negative binomial regression model, note that G2

= 195.81. Using a approximation, we obtain a p-value of 0.0939.


3.94

3.5 Fitting generalized linear models

One of the best things about GLMs is that they provide a unified approach to test model parameters, che

GLMs are fit (i.e., parameter estimates found) using maximum likelihood estimation. Except in simple cases, there is not one formula for the Chapter 3 models which can be written out that gives the parameter estimates.

For Poisson regression, the likelihood function is

Then the log likelihood function is

where means proportional (the last term does not depend on the parameters).

Since i(x) = , this implies


3.95

n different parameters

Now only two parameters!

The Chapter 3 additional lecture notes gives additional general information about one common procedure, the Newton-Raphson method, and how it can be used to find the maximum likelihood estimates in an iterative manner. Pay special attention to how “convergence” is obtained. Note that glm() function actually uses the optim() function to do the maximization (equivalently, minimization of the negative log likelihood function). The optim() function has a few different iterative procedures (some do not need derivatives) that can be used to perform the maximization.


3.96

4€¦ · web viewuse models to investigate the relationships (associations) among categorical...

Documents