multinomial logit regression - home -...

07/01/16

1

Multinomial logit regression

Introduction We now turn our attention to regression models for the analysis of categorical dependent variables with more than two response categories: ›  Y – car owned (many possible cars, i.e. Skoda, Fiat, Citroen, etc.), ›  Y – socioeconomic status (good, bad, average), ›  Y – mobile phone provider (i.e. Virgin, Orange, T-Mobile) Dependent variable can be ordered (i.e. socioeconomic status) or not – unordered (i.e. car owned, mobile phone provider) Several of the models that we will study may be considered generalizations of logistic regression analysis to polycho-tomous data.

07/01/16

2

Introduction The multinomial logit model assumes that data are case

specific; that is, each independent variable has a single value for each case. The multinomial logit model also assumes that the dependent variable cannot be perfectly predicted from the independent variables for any case. If the multinomial logit is used to model choices, it relies on the assumption of independence of irrelevant alternatives (IIA), which is not always desirable. This assumption states that the odds of preferring one class over another do not depend on the presence or absence of other "irrelevant" alternatives. For example, the relative probabilities of taking a car or bus to work do not change if a bicycle is added as an additional possibility.

Introduction Multinomial logit regression models, the multiclass extension of binary logistic regression, have long been used in econometrics in the context of modeling discrete choice (McFadden 1974; Bhat 1995; Train 2003) and in machine learning as a linear classication technique (Hastie, Tibshirani, and Friedman 2009) for tasks such as text classication (Nigam, Laerty, and McCallum 1999).

07/01/16

3

Generalized odds ratio (OR) Agresti Table 2.1 – page 37

For the 2 x 2 table, a a single measure can summarize the association For the general I x J case, a single measure cannot summarize the association without loss of information We want to estimate the association of aspirin use on myocardial infarction (heart attack)

Myocardial infarction (MI) Fatal attack Nonfatal attack No attack

Placebo 18 171 10 845 Aspirin 5 99 10 933

Generalized odds ratio (OR) We could collapse fatal attack and nonfatal attack categories

together to get information:

Then, the odds ratio of having myocardial infarction is:

Thus, the odds of a MI are 1.83 times higher when taking placebo when compared to aspirin.

Myocardial infarction Fatal attack or nonfatal attack No attack

Placebo 189 10 845 Aspirin 104 10 933

.83110845*10410933*189

==MIOR

07/01/16

4

Generalized odds ratio (OR) For the general I x J case:

There are pairs of rows There are pairs of columns This can produce estimates of the odds ratio We are going to consider three cases of the generalized odds ratio

⎟⎠

⎞⎜⎝

⎛2I

⎟⎠

⎞⎜⎝

⎛2J

⎟⎠

⎞⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛22JI

Generalized odds ratio (OR) For rows a and b, and columns c and d, the odds ratio:

is most loosely defined set of generalized odds ratios. There are of this type. For our heart attack example, lets compare fatal heart attack to no heart attack: That is the odds of having fatal heart attack vs no heart attack are 3.63 Times higher for the placebo group when compared to the group taking aspirin

( )adbccdac ππππ

⎟⎠

⎞⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛22JI

.63310845*510933*18

MI no vsfatal ==OR

07/01/16

5

Generalized odds ratio (OR) The local odds rations are obtained by comparing adjacent rows

and columns. That is:

For our heart attack example: 1.  Fatal heart attack vs non fatal heart attack:

2.  Non fatal heart attack vs no heart attack: There are local odds ratio

1,,1

1,1

++

++=jiji

jiijijOR

ππ

ππ

( ) ( ) .08217159918 =⋅⋅=OR

( ) ( ) .741108459910933171 =⋅⋅=OR

( )( )11 −− JI

Generalized odds ratio (OR) For the I x J table with I representing last row and J representing

last column, then:

represents odds ratio obtained by referencing the last column and last row. For our heart attack example:

1,,2,1,1,,2,1, −=−== JjIiJiIj

IJijij ……

ππ

ππα

( ) ( ) 62.310933*510933*1811 ==α

( ) ( ) 74.110845*9910933*17112 ==α

07/01/16

6

Generalized odds ratio (OR) Summary of the generalized methods:

1.  We have focused on an arbitrary I x J table 2.  Just as logistic regression extended the odds ratio for a

binary outcome with several predictors 3.  Multinomial logistic regression will extend the OR estimation

for the three cases presented previously to multiple predictors

Multinomial regression In general, suppose the response for individual i is discrete with

J levels:

Let xi be the covariates for individual i. If Yi is binary J = 2, we usually use logistic regression model.

⎪⎪⎩

⎪⎪⎨

⎧

=

iJ

i

i

i

pJ

pp

Y

prob with if

prob with if2 prob with if 1

2

1

!

07/01/16

7

Multinomial regression › When J = 2, we form J – 1 = 1, non-redundant logits

› When J > 2, we often use poytomus (or multinomial) logistic regression, forming J – 1 non-redundant logits:

[ ][ ][ ][ ]

[ ][ ] iJJKJKiJJ

iKii

iKii

iKKiiKii

iKii

iKKiiKii

iKii

xxxxJYPxxJYP

xxxxJYPxxYP

xxxxJYPxxYP

xβ

xβ

xβ

ʹ=+++=⎥⎦

⎤⎢⎣

⎡

=−=

ʹ=+++=⎥⎦

⎤⎢⎣

⎡

==

ʹ=+++=⎥⎦

⎤⎢⎣

⎡

==

βββ

βββ

βββ

………

!

………

………

1101

1

222121201

1

111111101

1

,,|,,|1log

,,|,,|2log

,,|,,|1log

Multinomial regression Each one of these logits can have different set of parameters βj.

Basically, we can think of the j-th logit: as a usual logistic regression model when restricting yourself to categories j and J Here we have formulated the „last column (reference)” definition of the generalized odds ratio

[ ][ ] ijjKjKijj

iKii

iKii xxxxJYPxxjYP xβʹ=+++=⎥

⎦

⎤⎢⎣

⎡

==

βββ ………

1101

1

,,|,,|log

07/01/16

8

Multinomial regression Lets consider probabilities when J > 2:

when j < J and when j = J We know, that:

[ ][ ]∑

−

=ʹ+

ʹ= 1

1exp1

expJ

j ij

ijijp

xβ

xβ

[ ]∑−

=ʹ+

= 1

1exp11

J

j ijijp

xβ

∑∈

=Jj

ijp 1

Multinomial regression Proof:

[ ][ ] [ ]

[ ]

[ ] [ ]

[ ]

[ ]1

exp1

exp1

exp11

exp1

exp

exp11

exp1exp

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1 1

1

=ʹ+

ʹ+

=ʹ+

+ʹ+

ʹ

=ʹ+

+⎟⎟

⎠

⎞

⎜⎜

⎝

⎛

ʹ+

ʹ=+=

∑∑

∑∑∑

∑∑ ∑ ∑

∑

−

=

−

=

−

=

−

=

−

=

−

=

=

−

=

−

= −

=

J

j ii

J

j ii

J

j iiJ

j ii

J

j ii

J

j ii

J

j

J

j

J

j J

j ii

iiiJijij ppp

xβ

xβ

xβxβ

xβ

xβxβxβ

07/01/16

9

Multinomial regression Log-odds for category j versus J for covariates is:

We want to know the interpretation of βjk’s:

is the log-odds ratio for response j versus J for a one unit increase in covariate xik

( )iKi xx ,,1 …

iKjKikjkijjiJ

ij xxxpp

ββββ +++++=⎥⎦

⎤⎢⎣

⎡……110log

jkβ

Multinomial regression For now we have looked at response j vesrus J

Using our previous heart attack example β11 would be the log-odds of having fatal heart attack versus no heart attack for subjects on placebo when comparing to subjects on aspirin Similarly β12 is the log-odds of having non-fatal heart attack instead of a fatal heart attack Previously, we stated that this model sufficiently describes all possible odd ratios Therefore, we should be able to estimate the odd ratio for an arbitrary response j versus j ’.

( ) ( )( )11 −×− JI

07/01/16

10

Multinomial regression Suppose we want log-odds ratio for response j versus j ’ for a

one unit increase in covariate xik:

Then:

Is the log-odds ratio for response j ’ versus j for a one unit increase in covariate xik

[ ]jkkjiJij

Jiji

iJji

Jiji

ijji

jiji

pppp

pppp

pppp

ββ −=⎥⎦

⎤⎢⎣

⎡−⎥⎦

⎤⎢⎣

⎡=⎥

⎦

⎤⎢⎣

⎡ʹ

ʹʹ

ʹ

ʹʹʹ

ʹ

ʹʹʹ logloglog

[ ]jkkj ββ −ʹ

Maximum likehood using the multinomial To write down the multinomial likehood, we form J indicator

random variables (J – 1 of which are non-redundant):

Maximum likehood can be used to estimate the parameters of these models, i.e. maximize:

as the function of:

⎩⎨⎧ =

=otherwiseif0

if1 jYY iij

( ) ∏∏= =

=n

i

J

j

yijijpL

1 1

β

[ ]́ʹʹʹ= Jβββ ,,, 21 …β

07/01/16

11

Example 1 For the following example a fictitious data set will be used. The

data includes a single categorical dependent variable with three categories. The data also includes three continuous predictors. The data contains 600 cases. First, we will import the data using the ‘foreign’ package and we shall get a summary.

Example 1 Next, we need to identify the outcome variable as a factor (i.e.

categorical): mdata1<-mdata

mdata1$y<-as.factor(mdata1$y)

print(summary(mdata1))

07/01/16

12

Example 1 Next, we need to load the ‘mglogit’ package (Croissant, 2011)

which contains the functions for conducting the multinomial logistic regression. Note, the ‘mlogit’ packages requires six other packages. Next, we need to modify the data so that the multinomial logistic regression function can process it. To do this, we need to expand the outcome variable (y) much like we would for dummy coding a categorical variable for inclusion in standard multiple regression.

Example 1 mdata2<-mlogit.data(mdata1, varying=NULL, choice="y",

shape="wide")

print(head(mdata2))

07/01/16

13

Example 1 Now we can proceed with the multinomial logistic regression

analysis using the ‘mlogit’ function and the ubiquitous ‘summary’ function of the results. Note that the reference category is specified as “1”. model.1<-mlogit(y~1|x1+x2+x3, data=mdata2, reflevel="1")

print(summary(model.1))

07/01/16

14

Example 1 The results show the logistic coefficient (B) for each predictor

variable for each alternative category of the outcome variable; alternative category meaning, not the reference category. The logistic coefficient is the expected amount of change in the logit for each one unit change in the predictor. The logit is what is being predicted; it is the odds of membership in the category of the outcome variable which has been specified (here the first value: 1 was specified, rather than the alternative values 2 or 3). The closer a logistic coefficient is to zero, the less influence the predictor has in predicting the logit.

Example 1 The table also displays the standard error, t staistic, and the p-

value. The t test for each coefficient is used to determine if the coefficient is significantly different from zero. The Pseudo R-Square (McFadden’s R2) is treated as a measure of effect size, similar to how R2 is treated in standard multiple regression. However, these types of metrics do not represent the amount of variance in the outcome variable accounted for by the predictor variables. Higher values indicate better fit, but they should be interpreted with caution.

07/01/16

15

Example 1 The Likelihood Ratio chi-square test is alternative test of

goodness-of-fit. As with most chi-square based tests however, it is prone to inflation as sample size increases. Here, we see model fit is significant χ² = 1291.40, p < .001, which indicates our full model predicts significantly better, or more accurately, than the null model. To be clear, you want the p-value to be less than your established cutoff (generally 0.05) to indicate good fit. To get the expected B values, we can use the ‘exp’ function applied to the coefficients.

Example 1 print(exp(coef(model.1)))

The Exp(B) is the odds ratio associated with each predictor. We expect predictors which increase the logit to display Exp(B) greater than 1.0, those predictors which do not have an effect on the logit will display an Exp(B) of 1.0 and predictors which decease the logit will have Exp(B) values less than 1.0. Keep in mind, the first two listed (alt2, alt3) are for the intercepts.

multinomial logit regression - home -...

Documents