tams38-lecture11 linearmodels&logisticregression lecturer ...€¦ · tams38-lecture11...

TAMS38 - Lecture 11Linear models & Logistic regression

Lecturer: Jolanta Pielaszkiewicz

Matematisk statistik - Matematiska institutionen

Linköpings universitet

”When you reach the end of your rope,tie a knot in it and hang on.” - Thomas Jefferson

13 December, 2016

Singull, Pielaszkiewicz, MAI - LiU TAMS38 - Lecture 11

Contents 2

Linear models

Factorial design and regression analysis

Logistic regression

Deviance

Two examples

(Poisson regression)


Linear models 3

The models of our different factorial designs and models in theregression analysis are included in the class of linear models. Inparticular, the models in factorial design can be written asregression models by using dummy variables.

The linear model can be written as

Y = Xβ + ε : n× 1,

where β : (k+ 1)× 1 are unknown parameters, X : n× (k+ 1) isa known design matrix and

cov(Y ) = cov(ε) = σ2I.


One-Way ANOVA with dummy variables 4

Let

y1, . . . , y4 be observations from N(µ1, σ)y5, . . . , y7 be observations from N(µ2, σ)y8, . . . , y10 be observations from N(µ3, σ)

and

y = (y1, . . . , y4, y5, . . . , y7, y8, . . . , y10)′.


One-Way ANOVA, cont. 5

We have

Y1...Y4Y5...Y7Y8...Y10

︸︷︷︸

=Y

=

1 0 0...

......

1 0 00 1 0...

......

0 1 00 0 1...

......

0 0 1

︸︷︷︸

=X

µ1

µ2

µ3

︸︷︷︸

=µ

+

ε1...ε4ε5...ε7ε8...ε10

︸︷︷︸

=ε

,

i.e., a regression model with no constant term and we getestimates

µ = (X′X)−1X′y.

Exercise Show that the equation gives the ordinaryµ-estimator.



A parameterization which is common in the regression analysisis that we let

Yj = β0 + β1zj1 + β2zj2 + εj ,

where

zj1 =

{1, for sample 1,0, otherwise,

zj2 =

{1, for sample 2,0, otherwise.

Exercise Write X-matrix.



Note that

E(Yj) =

β0 + β1, for sample 1,β0 + β2, for sample 2,β0, for sample 3,

where β1 describes the difference between expectations ofsample 1 and sample 3 and β2 describes the difference betweenexpectations of sample 2 and sample 3.

If we want to compare samples 1 and 2 we should study β1 − β2.


Example - One-Way ANOVA 8

Measurements for the four laboratories from the example fromLecture 1 and Lecture 3.

A B C D0.25 0.18 0.19 0.230.27 0.28 0.25 0.300.22 0.21 0.27 0.280.30 0.23 0.24 0.280.27 0.25 0.18 0.240.28 0.20 0.26 0.340.32 0.27 0.28 0.200.24 0.19 0.24 0.180.31 0.24 0.25 0.240.26 0.22 0.20 0.280.21 0.29 0.21 0.220.28 0.16 0.19 0.21


Example, cont. 9

Model:Yj = β0 + β1zj1 + β2zj2 + β3zj3 + εj ,

where

zjk =

{1, for laboratory no. k,0, otherwise,

for k = 1, 2, 3. Now, we have expectations

E(Yj) =

β0 + β1, for sample 1,β0 + β2, for sample 2,β0 + β3, for sample 3,β0, for sample 4.


Example, cont. 10

Regression Analysis: y versus z1, z2, z3

The regression equation isy = 0.250 + 0.0175 z1 - 0.0233 z2 - 0.0200 z3

Predictor Coef SE Coef T PConstant 0.25000 0.01134 22.05 0.000z1 0.01750 0.01604 1.09 0.281z2 -0.02333 0.01604 -1.46 0.153z3 -0.02000 0.01604 -1.25 0.219

S = 0.0392809 R-Sq = 16.1% R-Sq(adj) = 10.4%

Analysis of Variance

Source DF SS MS F PRegression 3 0.013006 0.004335 2.81 0.050Residual Error 44 0.067892 0.001543Total 47 0.080898


Example, cont. 11

MTB > print m1

Data Display

Matrix XPXI1

0.0833333 -0.083333 -0.083333 -0.083333-0.0833333 0.166667 0.083333 0.083333-0.0833333 0.083333 0.166667 0.083333-0.0833333 0.083333 0.083333 0.166667


Example, cont. 12

One-way ANOVA: C5 versus C6

Source DF SS MS F PC6 3 0.01301 0.00434 2.81 0.050Error 44 0.06789 0.00154Total 47 0.08090

S = 0.03928 R-Sq = 16.08% R-Sq(adj) = 10.36%

Individual 95% CIs For Mean Based onPooled StDev

Level N Mean StDev --------+---------+---------+---------+-A 12 0.26750 0.03388 (--------*--------)B 12 0.22667 0.04097 (--------*--------)C 12 0.23000 0.03438 (--------*--------)D 12 0.25000 0.04651 (--------*--------)

--------+---------+---------+---------+-0.225 0.250 0.275 0.300

Pooled StDev = 0.03928


Example, cont. 13


Two-Way ANOVA 14

Let us have now two factors with two observations per cell

B1 B2 B3

A1 y1, y2 y3, y4 y5, y6A2 y7, y8 y9, y10 y11, y12

Let

z1 =

{1 for A-level 10 otherwise, u1 =

{1 for B-level 10 otherwise,

u2 =

{1 for B-level 20 otherwise.

Two factor model can be written as

Yj = β0 + α1zj1 + γ1uj1 + γ2uj2+

+ δ11zj1 · uj1 + δ12zj1 · uj2 + εj ,

that is equivalent to the usual two-factor model.

Exercise Write X-matrix.Singull, Pielaszkiewicz, MAI - LiU TAMS38 - Lecture 11

Two-Way ANOVA, cont. 15

Here, δ11 and δ12 are our parameters for interactions. Observthat only the dummy variables that are related to differentfactors should be multiplied. We obtain (a−1)(b−1) parametersthat corresponds to the interaction between pairs of factors.

Matrix of expectations for the cells is give by

B1 B2 B3

A1 β0 + α1 + γ1 + δ11 β0 + α1 + γ2 + δ12 β0 + α1

A2 β0 + γ1 β0 + γ2 β0

We have the additive model if and only if δ11 = δ12 = 0.


16

Regression model in the example above is possible to be doneeven if we miss some y-observations. Then one can use methodwith the incomplete factorial design.

When one builds model with three factors one gets informationregarding the three factor interactions through coefficientstanding besides the product of three dummy variables thatcorrespond to those different factors.

If one can analyze the factorial design as regression model,results are often more difficult to interpret than in the standardanalysis. The usual hypotheses must be translated into the newparameters etc.


Example 1 – Beetle mortality 17

The table below shows the number of beetles dead after fivehours exposure to gaseous carbon disulphide at variousconcentrations (data from Bliss, 1935).

Dose, xi Number of Number(log10CS2mgl−1) beetles, ni killed, yi

1.6907 59 61.7242 60 131.7552 62 181.7842 56 281.8113 63 521.8369 59 531.8610 62 611.8839 60 60


Example, cont. 18


Binomial distribution 19

A random variable Y follows Binomial distribution(Y ∼ Bin (n, p)) if probability function is given by

pY (y) =

(n

y

)py(1− p)n−y, y = 0, 1, . . . , n,

Assume that we have random variables Yi ∼ Bin (ni, pi) whereYi is the number of successes among ni trials, i = 1, . . . ,m.

Then one has m different parameters.


Log-likelihood function 20

Log-likelihood function (see Appendix) for the maximal modelwith m parameters is

l(p1, . . . , pm; y1, . . . , ym)

=N∑i=1

(yi log

(pi

1− pi

)+ ni log (1− pi) + log

(niyi

)).


Logistic regression 21

We want to explain the proportion of successes in each groupand we make it using maximum-likelihood-estimator

Pi =Yini

explained with the help of a number of explanatory variables.As expectation is

E(Yi) = nipi och E(Pi) = pi

we can use the following model with the probabilities pi

g(pi) = x′iβ.


Link function 22

The simplest case is the linear model

p = x′β.

Problem here is that x′β can become negative or bigger than 1and we know that obviously 0 ≤ p ≤ 1.

If we let

p = g−1(x′β

)=

∫ t

−∞f(z)dz,

where f(z) is the probability density function, so-calledtolerance- distribution, we ensure that p ∈ [0, 1].


Model: Linear 23

Tolerance function: Re[a, b]

p =x− ab− a

, a ≤ x ≤ b

Link function:

g(p) = p =x− ab− a

= β1 + β2x,

where β1 = − a

b− aand β2 =

1

b− a.


Model: Probit 24

Tolerance function: N(µ, σ)

p =1

σ√

2π

∫ x

−∞e−

(z−µ)2

2σ2 dz = Φ

(x− µσ

)Link function:

g(p) = Φ−1 (p) =x− µσ

= β1 + β2x Probit (Normit),

where β1 = −µσ

and β2 =1

σ.


Model: Logistic 25

Tolerance function: f(z) = β2eβ1+β2z

(1 + eβ1+β2z)2

p =eβ1+β2x

1 + eβ1+β2x

Link function:

g(p) = log

(p

1− p

)= β1 + β2x Logit


Model: Extreme value 26

Tolerance function: f(z) = β2 exp{β1 + β2z − eβ1+β2z

}p = 1− exp {− exp (β1 + β2x)}

Link function:

g(p) = log (− log(1− p)) =β1 + β2x

Complementary log-log (Gompit)


Link functions, cont. 27


Deviance 28

Assume that we have to models, one with p parameters and onewith the maximal amount of m parameters, where m > p. Letparameters be β0 : p× 1 and β1 : m× 1. Assume also that thesmaller model is a special case of the bigger one. Then we wantto test hypothesis

H0 : Smaller model with p parameters is the same good as themaximal model with m parameters,motH1 : Maximal model is better.

and we do it using the analysis of deviance.


Deviance 29

DefinitionDeviance is defined as

D = 2(l(β1;y)− l(β0;y)

).

One can show that under H0 it holds that

D ≈ χ2(m− p),

and we want to reject H0 in favor of H1 for large values of thedeviance D.


Deviance - Binomial distribution 30

We have random variables Yi ∼ Bin (ni, pi). The maximalmodel has m different parameters p1, . . . , pm with ML-estimates

P 1 = (p1, . . . , pm) ,

where

pi =yini.


Deviance - Binomial distribution, cont. 31

Let P 0 be ML-estimator for some other model (with fewerparameters). Then, the deviance is

D = 2(l(P 1,y

)− l(P 0,y

))= 2

m∑i=1

(yi log

pip0i

+ (ni − yi) log1− pi1− p0i

)

= 2

m∑i=1

(yi log

yinip0i

+ (ni − yi) logni − yi

ni(1− p0i)

)

= 2

m∑i=1

(yi log

yiyi

+ (ni − yi) logni − yini − yi

).


Deviance - Binomial distribution, cont. 32

Again the deviance

D = 2

n∑i=1

(yi log

yiyi

+ (ni − yi) logni − yini − yi

)has the form

D = 2∑

oi logoiei,

where oi are the observed values (yi and ni − yi), and ei are thefitted values (yi and ni − yi).


Example 1 – Beetle mortality 33

The table below shows the number of beetles dead after fivehours exposure to gaseous carbon disulphide at variousconcentrations (data from Bliss, 1935).

Dose, xi Number of Number(log10CS2mgl−1) beetles, ni killed, yi

1.6907 59 61.7242 60 131.7552 62 181.7842 56 281.8113 63 521.8369 59 531.8610 62 611.8839 60 60


Example, cont. 34

We will analyze the data using the different link functions givenabove. We start with the logit link function.

log

(p

1− p

)= β1 + β2x.

The log-likelihood function with the logic link function is

l =

n∑i=1

(yi (β1 + β2xi)− ni log

(1 + eβ1+β2xi

)+ log

(niyi

)).

We use MINITAB.


Example, cont. 35

Binary Logistic Regression: y_i, n_i versus x_i

Link Function: Logit

Response Information

Variable Value County_i Event 291

Non-event 190n_i Total 481

Logistic Regression TablePredictor Coef SE Coef Z PConstant -60.7175 5.18071 -11.72 0.000x_i 34.2703 2.91214 11.77 0.000

Log-Likelihood = -186.235Test that all slopes are zero: G = 272.970, DF = 1, P-Value = 0.000

Goodness-of-Fit TestsMethod Chi-Square DF PPearson 10.0268 6 0.124Deviance 11.2322 6 0.081Hosmer-Lemeshow 10.0268 6 0.124


Example, cont. 36


ML-estimators 37

Maximum-likelihood-estimators (MLE) have many goodproperties. For example the the large n we have

β ≈ N(β,I−1

),

where information matrix I is given by

I =(Ijk)j,k

=(

E (UjUk))j,k,

with Ui =∂l

∂βi. One can also prove that

I =

(−E

(∂2l

∂βj∂βk

))j,k

.

Let the element of the covariance matrix be denoted byI−1 =

(Ijk)j,k

.


Example, cont. 38


Example, cont. 39


Example - Embryogenic anthers 40

The data in the table are taken from Sangwan-Norrell (1977).They are numbers yjk of embryogenic anthers of the plantspecies Datura innoxia Mill. obtained when numbers njk ofanthers were prepared under several different conditions.

Centrifuging force (g)Storage condition 40 150 350

Control y1k 55 52 57n1k 102 99 108

Treatment y2k 55 50 50n2k 76 81 90


Example, cont. 41

We have one factor with two levels, storage in 3oC under 48hours (treatment) and a control group type of storage. There isalso a continuous explanatory variable corresponding to thedifferent centrifuging forces. We will investigate how the storageand centrifuging forces affect the number of embryogenicanthers.

If we plot pjk = yjk/njk against the logarithm of those differentcentrifuging forces xk, we obtain


Example, cont. 42


Example, cont. 43

We will do now to logistic models for πjk, probability thatanthers are embryogenic. The first model is a model withdifferent constant terms and different slope for those two groups.

logit πjk = β0 + α0zj + β1xk + α1zjxk

= β0 + α0zj + (β1 + α1zj)xk,

where zj = 0 for control group, zj = 1 for treatment group.

The other model has a different constant term but the sameslope for the two groups

logit πjk = β0 + α0zj + β1xk.

We use MINITAB.


Example, cont. - Model 1 44

Binary Logistic Regression: y, n versus z, x = logcf, zx


Logistic Regression TablePredictor Coef SE Coef Z PConstant 0.233910 0.628418 0.37 0.710z 1.97721 0.998079 1.98 0.048x = logcf -0.0227412 0.126851 -0.18 0.858zx -0.318628 0.198881 -1.60 0.109




Example, cont. - Model 2 45

Binary Logistic Regression: y, n versus z, x = logcf


Logistic Regression TablePredictor Coef SE Coef Z PConstant 0.876775 0.487037 1.80 0.072z 0.406841 0.174624 2.33 0.020x = logcf -0.154596 0.0970260 -1.59 0.111




46


47


48


Example - Poisson regression 49

Assume that we have the following observations

yi 2 3 6 7 8 9 10 12 15xi -1 -1 0 0 0 0 1 1 1

and that we want to fit a poisson regression. Then, we assumethat data are Poisson distributed.

A r.v. Y is Poisson distributed with parameter µ > 0(Y ∼ Po (µ)) if the probability function is given by

pY (y) = e−µµy

y!, y = 0, 1, . . . ,


Example - Poisson regression, cont. 50



We assume that we have the following model

EYi = µi = β1 + β2xi = x′iβ,

where xi = (1 xi)′ and β = (β1 β2)

′.We take link function g(µi) as identity function

g(µi) = µi.

If we try to maximize the likelihood function we deal with

l(β1, β2) =∑

yi log (β1 + β2xi)−∑

log (yi!)−Nβ1 − β2∑

xi

that is difficult to maximize. We use for example MATLAB.



The following MATLAB-code can solve the problem

y = [2 3 6 7 8 9 10 12 15]’;x = [-1 -1 0 0 0 0 1 1 1]’;m = 9;

lnL = @(b) -(y’*log(b(1) + b(2)*x) - m*b(1) - b(2)*sum(x));[b value] = fminsearch(lnL,[7 5]);

so the solution is

>> b

b =

7.4516 4.9353


Deviance - Possion distribution 53

Assume that we have response variables Y1, . . . , Ym andYi ∼ Po(µi). Assume that our big model that we would like totest is the one with all µi, i = 1, . . . ,m being different. Then wehave β = (µ1, . . . , µm)′ and log-likelihood function is

l(β1;y) =∑

yi logµi −∑

µi −∑

log yi!

and MLE is µi = yi with the value

l(β;y) =∑

yi log yi −∑

yi −∑

log yi!.


Deviance - Poisson distribution, cont. 54

Assume that the smaller model have p < n parameters with theMLEs λi, and a value

l(β0;y) =∑

yi log λi −∑

λi −∑

log yi!,

for the log-likelihood function.

Now, the deviance is

D = 2(l(β1)− l(β0)

)= 2

(∑yi log

yi

λi−∑(

yi − λi))

.


Deviance - Poisson distribution, cont. 55

The estimated means yi = λi are called the fitted value and onecan show that

∑(yi − yi) = 0 in many cases.

Hence, the deviance is

D = 2∑

yi logyiyi

= 2∑

oi logoiei,

where oi is the observed value (yi) and ei is the estimatedexpected value (yi).


Deviance - Pearsons χ2-test 56

If one do a Taylor series approximation for the terms in thedeviance, i.e.,

oi logoiei

= (oi − ei) +1

2

(oi − ei)2

ei+ . . .

The deviance is then approximately given as

D ≈ 2∑(

(oi − ei) +1

2

(oi − ei)2

ei− (oi − ei)

)=∑ (oi − ei)2

ei= X2.

Hence, the deviance is closely related to Pearsons χ2 test.



Fitted values in the numerical example above yi = β1 + β2xi,with β1 = 7.4516 and β2 = 4.9353 and deviance D = 0.9473.

xi yi yi yi log (yi/yi)

-1 2 2.5164 -0.4593-1 3 2.5164 0.52740 6 7.4516 -1.30000 7 7.4516 -0.43770 8 7.4516 0.56810 9 7.4516 1.69911 10 12.3869 -2.14061 12 12.3869 -0.38081 15 12.3869 2.8711∑

72 72 0.9473D = 2 · 0.9473 = 1.8946. If the small and maximal model are thesame good then D ∼ χ2

9−2 = χ27. We choose the big model if

D > χ27,0.95 = 14.07. Hence, we cannot reject the small model!


Appendix: Estimators 58

There are several ways to point to estimate the parameters of aprobability model

moment method,least square method,maximum-likelihood method.

We now want to look more closely at the maximum likelihoodmethod since it is one the most often used by us.


Likelihood function 59

Let x1, . . . , xn be a random sample with independentobservations from the distribution f(x;θ) that depends on theunknown parameters θ.

DefinitionThe function

L(θ) =

n∏i=1

f(xi;θ) = f(x1;θ) · . . . · f(xn;θ)

is called the likelihood function.


ML-estimator 60

Definition

A value θ, for which the likelihood function L(θ) obtains itshighest value is called maximum- likelihood-estimate(ML-estimate) of θ.

Before one maximize it is often convenient to take the logarithmof the likelihood function

l(θ) = logL(θ) =

n∑i=1

log f(xi;θ)

and then differentiate with respect to the parameters that youwant to maximize.


ML-estimator, cont. 61

Some of the maximum-likelihood-estimators (MLE) propertiesare given below.

If θ is MLE of θ then, under certain (rather mild) conditions,for large n we have

θ − Eθ√V arθ

≈ N(0, 1).


ML-estimator, cont. 62

This can be generalized to the multidimensional case where onecan show that for large n we have

θ ≈ N(θ,I−1

),

where information matrix I is given by

I = (Ijk) = (E (UjUk)) ,

with

Ui =∂l

∂θi.

One can show also that

I =

(−E ∂2l

∂θj∂θk

).


tams38-lecture11 linearmodels&logisticregression lecturer ...€¦ · tams38-lecture11...

Documents