class 6 qualitative dependent variable models skema ph.d programme 2010-2011 lionel nesta...
TRANSCRIPT
Class 6
Qualitative Dependent Variable Models
SKEMA Ph.D programme2010-2011
Lionel Nesta
Observatoire Français des Conjonctures Economiques
Structure of the class
1. The linear probability model
2. Maximum likelihood estimations
3. Binary logit models and some other models
4. Multinomial models
5. Ordered multinomial models
6. Count data models
The Linear Probability Model
The linear probability model
When the dependent variable is binary (0/1, for example, Y=1 if the firm innovates, 0 otherwise), OLS is called the linear probability model.
0 1 1 2 2Y x x u
How should one interpret βj? Provided that OLS4 – E(u|X)=0 – holds true, then:
0 1 1 2 2E(Y | X) x x
Y follows a Bernoulli distribution with expected value P. This model is called the linear probability model because its expected value, conditional on X, and written E(Y|X), can be interpreted as the conditional probability of the occurrence of Y given values of X.
E(Y | X) Pr(Y 1| X)
1 E(Y | X) Pr(Y 0 | X)
β measures the variation of the probability of success for a one-unit variation of X (ΔX=1)
E(Y | X) Pr(Y 1| X)Pr(Y 1| X)
X X
The linear probability model
Non normality of errors
OLS6 : The error term is independent of all RHS and follows a normal distribution with zero mean and variance σ²
Since the errors are the complement to unity of the conditional probability, they follow a Bernoulli distribution, not a normal distribution.
2u Normal(0, )
Limits of the linear probability model (1)
Non normality of errors
0.5
11
.52
2.5
De
nsi
ty
-1 -.5 0 .5Residuals
Limits of the linear probability model (1)
Heteroskedastic errors
OLS5 : The variance of the error term, u, conditional on RHS, is the same for all values of RHS
The error term is itself distributed Bernoulli, and its variance depends on X. Hence it is heteroskedastic
21 2 kVar u x ,x , ,x
Var(u) P(1 P) E(Y | X) (1 E(Y | X))
Limits of the linear probability model (2)
Heteroskedastic errors
-1-.
50
.5R
esi
du
als
.4 .6 .8 1 1.2Fitted values
Limits of the linear probability model (2)
Fallacious predictions
By definition, a probability is always in the unit interval [0;1]
But OLS does not guarantee this condition Predictions may lie outside the bound [0;1] The marginal effect is constant , since P = E(Y|X) grows linearly with X. This is not very realistic (ex: the probability to give birth conditional on the number of
children already born)
0 E Y | X 1
Limits of the linear probability model (3)
Fallacious predictions
01
23
De
nsi
ty
.4 .6 .8 1 1.2Fitted values
Fallacious predictions
Limits of the linear probability model (3)
A downward bias in the coefficient of determination R²
Observed values are 1 or 0, whereas predictions should lie between 0 and 1: [0;1].
Comparing predicted with observed variables, the goodness of fit as assessed by the R² is systematically low .
Limits of the linear probability model (4)
Fallacious predictions0
.2.4
.6.8
1D
um
my
inn
ova
tion
.4 .6 .8 1 1.2Fitted values
Fallacious predictions which lower the R2
Limits of the linear probability model (4)
1. Non normality of errors
2. Heteroskedastic errors
3. Fallacious predictions
4. A downward bias in the R² 0 E Y | X 1
21 2 kVar u x ,x , ,x
2u Normal(0, )
Limits of the linear probability model (4)
Overcoming the limits of the LPM1. Non normality of errors Increase sample size
2. Heteroskedastic errors Use robust estimators
3. Fallacious prediction Perform non linear or constrained regressions
4. A downward bias in the R² Do not use it as a measure of goodness of fit
Persistent use of LPM
Although it has limits, the LPM is still used
1. In the process of data exploration (early stages of the research)
2. It is a good indicator of the marginal effect of the representative observation (at the mean)
3. When dealing with very large samples, least squares can overcome the complications imposed by maximum likelihood techniques.
Time of computation Endogeneity and panel data problems
The LOGIT Model
Probability, odds and logit We need to explain the occurrence of an event: the LHS
variable takes two values : y={0;1}.
In fact, we need to explain the probability of occurrence of the event, conditional on X: P(Y=y | X) [0 ; 1]∈ .
OLS estimations are not adequate, because predictions can lie outside the interval [0 ; 1].
We need to transform a real number, say z to ]-∞;∈+∞[ into P(Y=y | X) [0 ; 1]∈ .
The logistic transformation links a real number z ]-∞;+∞[∈ to P(Y=y | X) [0 ; 1]∈ .It is also called the link function
The logit link function
Z
z
Z
ZZ Z
Z
z
e logit link
;
e
func
0;
e0;1 since e 1 e
1 e
is called the tion1 e
Let us make sure that the transformation of z lies between 0 and1
The logit model
Z
Z
Z Z
eP y 1| z
1 e1 1
P y 0 | z 1 P y 1| z 11 e 1 e
Hence the probability of any event to occur is :
But what is z?
ZZ
Z
P 1 ee
1 P 1 eP
ln1 P
z
The odds ratio is defined as the ratio of the probability and its complement. Taking the log yields z. Hence z is the log transform of the odds ratio.
This has two important characteristics :
1. Z ]-∞;+∞[ ∈ and P(Y=1) [0 ; 1]∈2. The probability is not linear in z (The plot linking z with
straight line)
The odds ratio
Probability, odds and logitP(Y=1) Odds
p(y=1)
1-p(y=1)Ln (odds)
0.01 1/99 0,01 -4,60
0.03 3/97 0,03 -3,48
0.05 5/95 0,05 -2,94
0.20 20/80 0,25 -1,39
0.30 30/70 0,43 -0,85
0.40 40/60 0,67 -0,41
0.50 50/50 1,00 0,00
0.60 60/40 1,50 0,41
0.70 70/30 2,33 0,85
0.80 80/20 4,00 1,39
0.95 95/5 19,0 2,94
0.97 97/3 32,3 3,48
0.99 99/1 99,0 4,60
The logit transformation
The preceding table matches levels of probability with the odds ratio.
The probability varies between 0 and 1, The odds varies between 0 and + ∞. The log of the odds varies between – ∞ and + ∞ .
Notice that the distribution of the log of the odds is symetrical.
Logistic probability density distribution
0.0
5.1
.15
.2.2
5D
en
sity
-10 -5 0 5 10Log (Odds ratio)
“The probability is not linear in z”
0.2
.4.6
.81
P(y
=1
| z)
-4 -2 0 2 4z
The logit link function
1 1 k k
Z
Z
z x x
e eHence is rewritten
1 e 1 e
Xβ
Xβ
z = Xβ
The whole trick that can overcome the OLS problem is then to posit:
But how can we estimate the above equation knowing that we do not observe z?
Maximum likelihood estimations OLS can be of much help. We will use Maximum Likelihood
Estimation (MLE) instead.
MLE is an alternative to OLS. It consists of finding the parameters values which is the most consistent with the data we have.
In Statistics, the likelihood is defined as the joint probability to observe a given sample, given the parameters involved in the generating function.
One way to distinguish between OLS and MLE is as follows:
OLS adapts the model to the data you have : you only have one model derived from your data. MLE instead supposes there is an infinity of
models, and chooses the model most likely to explain your data.
Let us assume that you have a sample of n random observations. Let f(yi ) be the probability that yi = 1 or yi = 0. The joint probability to observe jointly n values of yi is given by the likelihood function:
1 21
, ,..., ( )n
n ii
f y y y f y
We need to specify function f(.). It comes from the empirical descrite distribution of an event that can have only two outcome : a success (yi = 1) or a failure (yi = 0). This is the binomial distribution. Hence:
i i iiy 1 yk n ki
y 1 yi
i
1nf (y ) p (1 p) p (1 p) f (y ) p ( 1
yp)
k
Likelihood functions
Likelihood functions
Knowing p (as the logit), having defined f(.), we come up with the likelihood function:
i i
i i
i i
n ny 1 y
ii 1 i 1
y 1 yzn n
i z zi 1 i 1
y 1 yn n
ii 1 i 1
L y f (y ) p 1 p
e 1
,
L y, z f (y , z)1 e 1 e
e 1L y, x, f (y , )
1 e 1 e
Xβ
Xβ XβX β
The log transform of the likelihood function (the log likelihood) is much easier to manipulate, and is written:
n nz
ii 1 i 1
n n
ii 1 i 1
n
ii 1
LL y,z y z ln 1 e
LL y, x, y ln 1 e
LL y, x, ln 1 e y
Xβ
Xβ
Xβ
Xβ
Log likelihood (LL) functions
The LL function can yield an infinity of values for the parameters β.
Given the functional form of f(.) and the n observations at hand, which values of parameters β maximize the likelihood of my sample?
In other words, what are the most likely values of my unknown parameters β given the sample I have?
Maximum likelihood estimations
n
i i i zi 1
i zn
i i i ii 1
LLy x 0
ewhere
1 e²LL1 x x
However, there is not analytical solutions to this non linear problem. Instead, we rely on a optimization algorithm (Newton-Raphson)
The LL is globally concave and has a maximum. The gradient is used to compute the parameters of interest, and the hessian is used to compute the variance-covariance matrix.
Maximum likelihood estimations
You need to imagine that the computer is going to generate all possible values of β, and is going to compute a likelihood value for each (vector of ) values to then choose (the vector of) β such that the likelihood is
highest.
Example: Binary Dependent Variable
We want to explore the factors affecting the probability of being successful innovator (inno = 1): Why?
352 (81.7%) innovate and 79 (18.3%) do not.
The odds of carrying out a successful innovation is about 4 against 1 (as 352/79=4.45).
The log of the odds is 1.494 (z = 1.494)
For the sample (and the population?) of firms the probability of being innovative is four times higher than the probability of NOT being innovative
Instruction Stata : logit
logit y x1 x2 x3 … xk [if] [weight] [, options]
Options
noconstant : estimates the model without the constant
robust : estimates robust variances, also in case of heteroscedasticity
if : it allows to select the observations we want to include in the analysis
weight : it allows to weight different observations
Logistic Regression with STATA
Let’s start and run a constant only model logit inno
_cons 1.494183 .1244955 12.00 0.000 1.250177 1.73819 inno Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -205.30803 Pseudo R2 = 0.0000 Prob > chi2 = . LR chi2(0) = 0.00Logistic regression Number of obs = 431
Iteration 0: log likelihood = -205.30803
. logit inno
.
Goodness of fit
Parameter estimates, Standard errors and z values
Logistic Regression with STATA
What does this simple model tell us ?
Remember that we need to use the logit formula to transform the logit into a probability :
e
P(Y 1| )1 e
Xβ
XβX
Interpretation of Coefficients
The constant 1.491 must be interpreted as the log of the odds ratio.
Using the logit link function, the average probability to innovate is
dis exp(_b[_cons])/(1+exp(_b[_cons]))
We find exactly the empirical sample value: 81,7%
1,494
1,494
eP 0,817
1 e
Interpretation of Coefficients
A positive coefficient indicates that the probability of innovation success increases with the corresponding explanatory variable.
A negative coefficient implies that the probability to innovate decreases with the corresponding explanatory variable.
Warning! One of the problems encountered in interpreting probabilities is their non-linearity: the probabilities do not vary in the same way according to the level of regressors
This is the reason why it is normal in practice to calculate the probability of (the event occurring) at the average point of the sample
Interpretation of Coefficients
Interpretation of Coefficients
_cons -11.63447 1.937191 -6.01 0.000 -15.43129 -7.837643 biotech 3.799953 .577509 6.58 0.000 2.668056 4.93185 spe .4252844 .4204924 1.01 0.312 -.3988654 1.249434 lassets .997085 .1368534 7.29 0.000 .7288574 1.265313 lrdi .7527497 .2110683 3.57 0.000 .3390634 1.166436 inno Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -163.45352 Pseudo R2 = 0.2039 Prob > chi2 = 0.0000 LR chi2(4) = 83.71Logistic regression Number of obs = 431
Iteration 4: log likelihood = -163.45352Iteration 3: log likelihood = -163.45376Iteration 2: log likelihood = -163.57746Iteration 1: log likelihood = -167.71312Iteration 0: log likelihood = -205.30803
. logit inno lrdi lassets spe biotech
Let’s run the more complete model logit inno lrdi lassets spe biotech
-11.63 0.75 0.99 0.43 3.79
-11.63 0.75 0.99 0.43 3.79
eP
1 e
rdi lassets spe biotech
rdi lassets spe bi
otech
Using the sample mean values of rdi, lassets, spe and biotech, we compute the conditional probability :
-11.63 0.75 0.99 0.43 3.79
-11.63 0
rdi lassets spe biotech
rdi lassets s.75 0.99 0.43 3.79pe biotech
eP
1 e
e0,8758
1 e
1.953
1.953
Interpretation of Coefficients
It is often useful to know the marginal effect of a regressor on the probability that the event occur (innovation)
As the probability is a non-linear function of explanatory variables, the change in probability due to a change in one of the explanatory variables is not identical if the other variables are at the average, median or first quartile, etc. level.
prvalue provides the predicted probabilities of a logit model (or any other) prvalue prvalue , x(lassets=10) rest(mean) prvalue , x(lassets=11) rest(mean) prvalue , x(lassets=12) rest(mean) prvalue , x(lassets=10) rest(median) prvalue , x(lassets=11) rest(median) prvalue , x(lassets=12) rest(median)
Marginal Effects
prchange provides the marginal effect of each of the explanatory variables for the majority of the variations of the desired values
prchange [varlist] [if] [in range] ,x(variables_and_values) rest(stat) fromto
prchange
prchange, fromto
prchange , fromto x(size=10.5) rest(mean)
Marginal Effects
Goodness of Fit Measures
In ML estimations, there is no such measure as the R2
But the log likelihood measure can be used to assess the goodness of fit. But note the following : The higher the number of observations, the lower the joint probability, the
more the LL measures goes towards -∞ Given the number of observations, the better the fit, the higher the LL
measures (since it is always negative, the closer to zero it is)
The philosophy is to compare two models looking at their LL values. One is meant to be the constrained model, the other one is the unconstrained model.
Goodness of Fit Measures
A model is said to be constrained when the observed set the parameters associated with some variable to zero.
A model is said to be unconstrained when the observer release this assumption and allows the parameters associated with some variable to be different from zero.
For example, we can compare two models, one with no explanatory variables, one with all our explanatory variables. The one with no explanatory variables implicitly assume that all parameters are equal to zero. Hence it is the constrained model because we (implicitly) constrain the parameters to be nil.
The likelihood ratio test (LR test) The most used measure of goodness of fit in ML estimations is the
likelihood ratio. The likelihood ratio is the difference between the unconstrained model and the constrained model. This difference is distributed 2.
If the difference in the LL values is (no) important, it is because the set of explanatory variables brings in (un)significant information. The null hypothesis H0 is that the model brings no significant information as
follows:
High LR values will lead the observer to reject hypothesis H0 and accept
the alternative hypothesis Ha that the set of explanatory variables does
significantly explain the outcome.
unc cLR 2 ln L ln L
The McFadden Pseudo R2
We also use the McFadden Pseudo R2 (1973). Its interpretation is analogous to the OLS R2. However its is biased doward and remain generally low.
Le pseudo-R2 also compares The likelihood ratio is the difference between the unconstrained model and the constrained model and is comprised between 0 and 1.
c unc2 uncMF
unc c
ln L ln L ln LPseudo R 1
ln L ln L
Goodness of Fit Measures
_cons -11.63447 1.937191 -6.01 0.000 -15.43129 -7.837643 biotech 3.799953 .577509 6.58 0.000 2.668056 4.93185 spe .4252844 .4204924 1.01 0.312 -.3988654 1.249434 lassets .997085 .1368534 7.29 0.000 .7288574 1.265313 lrdi .7527497 .2110683 3.57 0.000 .3390634 1.166436 inno Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -163.45352 Pseudo R2 = 0.2039 Prob > chi2 = 0.0000 LR chi2(4) = 83.71Logistic regression Number of obs = 431
. logit inno lrdi lassets spe biotech, nolog
_cons 1.494183 .1244955 12.00 0.000 1.250177 1.73819 inno Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -205.30803 Pseudo R2 = 0.0000 Prob > chi2 = . LR chi2(0) = 0.00Logistic regression Number of obs = 431
Iteration 0: log likelihood = -205.30803
. logit inno
Constrained model
Unconstrained model
unc cLR 2 ln L ln L
2 163.5 205.3
83.8
2MF unc cPs.R 1 ln L ln L
1 163.5 205.3
0.204
Other usage of the LR test
The LR test can also be generalized to compare any two models, the unconstrained one being nested in the constrained one.
Any variable which is added to a model can be tested for its explanatory power as follows : logit [modèle contraint]
est store [nom1]
logit [modèle non contraint]
est store [nom2]
lrtest nom2 nom1
Goodness of Fit Measures
LR test on the added variable (biotech)
unc cLR 2 ln L ln L
2 163.5 191.8
56.8
(Assumption: model1 nested in model2) Prob > chi2 = 0.0000Likelihood-ratio test LR chi2(1) = 56.78
. lrtest model2 model1
. est store model2
_cons -11.63447 1.937191 -6.01 0.000 -15.43129 -7.837643 biotech 3.799953 .577509 6.58 0.000 2.668056 4.93185 spe .4252844 .4204924 1.01 0.312 -.3988654 1.249434 lassets .997085 .1368534 7.29 0.000 .7288574 1.265313 lrdi .7527497 .2110683 3.57 0.000 .3390634 1.166436 inno Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -163.45352 Pseudo R2 = 0.2039 Prob > chi2 = 0.0000 LR chi2(4) = 83.71Logistic regression Number of obs = 431
. logit inno lrdi lassets spe biotech, nolog
. est store model1
_cons -.4703812 .9313494 -0.51 0.614 -2.295793 1.35503 spe .3739987 .3800765 0.98 0.325 -.3709376 1.118935 lassets .3032756 .0792032 3.83 0.000 .1480402 .4585111 lrdi .9275668 .1979951 4.68 0.000 .5395037 1.31563 inno Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -191.84522 Pseudo R2 = 0.0656 Prob > chi2 = 0.0000 LR chi2(3) = 26.93Logistic regression Number of obs = 431
. logit inno lrdi lassets spe, nolog
Quality of predictions
Lastly, one can compare the quality of the prediction with the observed outcome variable (dummy variable).
One must assume that when the probability is higher than 0.5, then the prediction is that the vent will occur (most likely
And then one can compare how good the prediction is as compared with the actual outcome variable.
STATA does this for us:
estat class
Quality of predictions
Correctly classified 84.69% False - rate for classified - Pr( D| -) 34.88%False + rate for classified + Pr(~D| +) 13.14%False - rate for true D Pr( -| D) 4.26%False + rate for true ~D Pr( +|~D) 64.56% Negative predictive value Pr(~D| -) 65.12%Positive predictive value Pr( D| +) 86.86%Specificity Pr( -|~D) 35.44%Sensitivity Pr( +| D) 95.74% True D defined as inno != 0Classified + if predicted Pr(D) >= .5
Total 352 79 431 - 15 28 43 + 337 51 388 Classified D ~D Total True
Logistic model for inno
. estat class
The Logit model is only one way of modeling binary choice models
The Probit model is another way of modeling binary choice models. It is actually more used than logit models and assume a normal distribution (not a logistic one) for the z values.
The complementary log-log models is used where the occurrence of the event is very rare, with the distribution of z being asymetric.
Other Binary Choice models
Other Binary Choice models
Probit model
Complementary log-log model
22 2z 2z e e
Pr(Y 1| X) dz dz t dz2 2
X β
X β X βXβ
Pr(Y 1| X) 1 exp exp( ) X β X β
Likelihood functions and Stata commands
1
1 1
1
1 1
1
1( , , ) ( , , )
1 1
( , , ) ( , , ) ( ) 1 ( )
( , , ) ( , , ) 1 exp( exp( )) exp( exp(
i i
i i
i
y yn n
i ii i
n ny y
i ii i
ny
i ii
eL y x f y x
e e
L y x f y x
L y x f y x
X β
X β X β
X β X β
X β
Logit :
Probit :
Log-log comp : 1
1
)) in
y
i
X β
Example logit inno rdi lassets spe pharmaprobit inno rdi lassets spe pharmacloglog inno rdi lassets spe pharma
Probability Density Functions
0.1
.2.3
.4y
-4 -2 0 2 4x
Probit Transformation Logit TransformationComplementary log log Transformation
Cumulative Distribution Functions
0.2
.4.6
.81
y
-4 -2 0 2 4x
Probit Transformation Logit TransformationComplementary log log Transformation
Comparison of modelsOLS Logit Probit C log-log
Ln(R&D intensity) 0.110 0.752 0.422 354
[3.90]*** [3.57]*** [3.46]*** [3.13]***
ln(Assets) 0.125 0.997 0.564 0.493
[8.58]*** [7.29]*** [7.53]*** [7.19]***
Spe 0.056 0.425 0.224 0.151
[1.11] [1.01] [0.98] [0.76]
BiotechDummy 0.442 3.799 2.120 1.817
[7.49]*** [6.58]*** [6.77]*** [6.51]***
Constant -0.843 -11.634 -6.576 -6.086
[3.91]** [6.01]*** [6.12]*** [6.08]***
Observations 431 431 431 431
Absolute t value in brackets (OLS) z value for other models.
* 10%, ** 5%, *** 1%
Comparison of marginal effects
OLS Logit Probit C log-log
Ln(R&D intensity) 0.110 0.082 0.090 0.098
ln(Assets) 0.125 0.110 0.121 0.136
Specialisation 0.056 0.046 0.047 0.042
Biotech Dummy 0.442 0.368 0.374 0.379
For all models logit, probit and cloglog, marginal effects have been computed for a one-unit variation (around the mean) of the variable at stake, holding all other variables at the sample mean values.
Multinomial LOGIT Models
Multinomial modelsLet us now focus on the case where the dependent variable has
several outcomes (or is multinomial). For example, innovative firms
may need to collaborate with other organizations. One can code this
type of interactions as follows Collaborate with university (modality 1) Collaborate with large incumbent firms (modality 2) Collaborate with SMEs (modality 3) Do it alone (modality 4)
Or, studying firm survival Survival (modality 1) Liquidation (modality 2) Mergers & acquisition (modality 3)
One could first perform three logistic regressions as follows :
(1) (1) (1)0 1 1 m m
(2) (2) (2)0 1 1 m m
(3) (3) (3)0 1 1 m m
P(Y 1| X)ln x x
1 P(Y 1| X)
P(Y 2 | X)ln x x
1 P(Y 2 | X)
P(Y 3 | X)ln x x
1 P(Y 3 | X)
Where 1 = survival, 2 = liquidation, 3 = M&A.1. Open the file mlogit.dta2. Estimate for each type of outcome the conditional probability
of the event for the representative firm - time (log_time) - size (log labour)- firm age (entry_age)- Spin out (spin_out)- Cohort (cohort_*)
Multinomial models
(1) (1) (1)0 1 1 m m
(2) (2) (2)0 1 1 m m
(3) (3) (3)0 1 1 m m
P(Y 1| X)ln x x
1 P(Y 1| X)
P(Y 2 | X)ln x x
1 P(Y 2 | X)
P(Y 3 | X)ln x x
1 P(Y 3 X)
|
P(Y 1| X) 0.8771
P(Y 2 | X) 0.0398
P(Y 3 | X) 0.0679
k
P(Y k | X) 0.9848 1
The need for multinomial models
First, the sum of all conditional probabilities should add up to unity
k
j k
P Y 0 | X 1 P Y j | X
k
j 0
P Y j | X 1
Second, for k outcomes, we need to estimate (k-1) modality. Hence
Multinomial models
Third, the multinomial model is a simultaneous (as opposed to
sequential) estimation model comparing the odds of each modality
with respect to all others. With three outcomes, we have:
(1|0) (1|0) (1|0)0 1 1 m m
(2|0) (2|0) (2|0)0 1 1 m m
(1|2) (1|2) (1|2)0 1 1 m m
P(Y 1| X)ln x x
P(Y 0 | X)
P(Y 2 | X)ln x x
P(Y 0 | X)
P(Y 1| X)ln x x
P(Y 2 | X)
Multinomial logit models
P Y 1| X P Y 2 | X P Y 1| Xln ln ln
P Y 0 | X P Y 0 | X P Y 2 | X
Note that there is redundancy, since :
1|0 2|0 1|2P Y 1| X P Y 2 | X P Y 1| Xln x ;ln x ;ln x
P Y 0 | X P Y 0 | X P Y 2 | X
1|0 2|0 1|2x x x
1|0 2|0 1|2
Fourth, the multinomial logit model estimates (k – 1) outcomes with following constrained:
Multinomial logit models
( j|0 )
( j|0 )
x
j kx
j 0
eP Y j | X
e
With k outcomes, the probability of occurrence of event j reads:
By convention, outcome 0 is the base outcome
Multinomial logit models
j| j P Y j | Xx ln ln(1) 0
P Y j | X
Note that j| jx, j : 0
( j|0 )j kx
j 1
1P Y 0 | X
1 e
( j|0 )
( j|0 )
x
j kx
j 1
eP Y j | X
1 e
( j|0 )
( j|0 )
x
j kx
j 0
eP Y j | X
e
Multinomial logit models
Binomial logit as multinomial logitLet us rewrite the probability of event that Y=1
The binomial logit binomial is a special case of the multinomial where only two outcomes are being analyzed.
(1|0) (1|0) (1|0)
(1|0) (0|0) (1|0) ( k|0)
x
x
x x x
x x x x
k 0,1
eP Y 1| X
1 e
e e eP Y 1| X
1 e e e e
Let us assume that you have a sample of n random observations. Let f(yj ) be the probability that yi = j. The joint probability to observe jointly n values of yj is given by the likelihood function:
1 21
, ,..., ( )n
n ii
f y y y f y
We need to specify function f(.). It comes from the empirical discrete distribution of an event that can have several outcomes. This is the multinomial distribution. Hence:
j0 1 k ki i i i idYdY dY dY dY
j 0 1 j k jj K
f (y ) p p p p p
Likelihood functions
The maximum likelihood function The maximum likelihood function reads:
ji
j0i i
( j|0)
( j|0) ( j|0)
n n kdY
i ji 1 i 1 j 1
dY dY
xn n k( j|0)
i i j k j kx xi 1 i 1 j 1
j 1 j 1
L(y) f y p
1 eL(y) f y , x ,
1 e 1 e
The maximum likelihood functionThe log transform of the likelihood yields
( j|0)i
( j|0) ( j|0)i i
( j|0)i i
xn k( j|0) 0 j
i ij k j kx xi 1 j 1
j 0 j 0
j kx x( j|0) j ( j|0)
i ij 0
1 eLL(y, x, ) dy ln dy ln
1 e 1 e
LL(y, x, ) ln 1 e dy x ln 1 e
( j|0)
( j|0)i
j kn k
i 1 j 1 j 0
j kn k kx( j|0) j ( j|0)
i ii 1 j 1 j 1 j 0
LL(y, x, ) dy x k 1 ln 1 e
Multinomial logit models
Stata Instruction : mlogit
mlogit y x1 x2 x3 … xk [if] [weight] [, options]
Options : noconstant : omits the constant
robust : controls for heteroskedasticity
if : select observations
weight : weights observations
use mlogit.dta, clear mlogit type_exit log_time log_labour entry_age entry_spin cohort_*
Base outcome, chosen by STATA, with the highest empirical frequency
Goodness of fit
Parameter estimates, Standard errors and z values
Multinomial logit models
Interpretation of coefficientsThe interpretation of coefficients always refer to the base category
Does the probability of being bought-out decrease overtime ?
No!Relative to survival the probability of being bought-out decrease overtime
Interpretation of coefficientsThe interpretation of coefficients always refer to the base category
Is the probability of being bought-out lower for spinoff?
No!Relative to survival the probability of being bought-out is lower for spinoff
Interpretation of coefficients
Relative to liquidation, the probability of being bought-out is higher for
spinoff
1|0 2|0 1|2 2|0 1|0 2|1
lincom [boughtout]entry_spin – [death]entry_spin
Changing base outcomemcross provides other estimates by changing the base ouctome
Mind the new base outcome!!
Being bought-out relative to liquidation
Relative to liquidation, the probability of being bought-out is
higher for spinoff
And we observe the same results as before
mcross provides other estimates by changing the base ouctome
Changing base outcome
Independence of irrelevant alternatives - IAA The model assumes that each pair of outcome is independent from all
other alternatives. In other words, alternatives are irrelevant.
From a statistical viewpoint, this is tantamount to assuming independence of the error terms across pairs of alternatives
A simple way to test the IIA property is to estimate the model taking off one modality (called the restrained model), and to compare the parameters with those of the complete model
If IIA holds, the parameters should not change significantly
If IIA does not hold, the parameters should change significantly
H0: The IIA property is valid
H1: The IIA property is not valid
1* * *
R C R C R Cˆ ˆ ˆ ˆ ˆ ˆH var var
The H statistics (H stands for Hausman) follows a χ² distribution with M degree of freedom (M being the number of parameters)
Independence of irrelevant alternatives - IAA
STATA application: the IIA test
H0: The IIA property is valid
H1: The IIA property is not valid
mlogtest, hausman
Omitted variable
Application de IIA
mlogtest, hausmanWe compare the parameters of the model
“liquidation relative bought-out”estimated simultaneously with “survival relative to bought-out”
avec
the parameters of the model
“liquidation relative bought-out”estimated without
“survival relative to bought-out”
H0: The IIA property is valid
H1: The IIA property is not valid
Application de IIA
mlogtest, hausman
The conclusion is that outcome survival significantly alters the choice between
liquidation and bought-out.
In fact for a company, being bought-out must be seen as a way to remain active with a cost of losing control on economic decision, notably
investment.
H0: The IIA property is valid
H1: The IIA property is not valid
Ordered Multinomial LOGIT Models
Ordered multinomial models
Let us now concentrate on the case where the dependent variable is
a discrete integer which indicates an intensity. Opinion surveys make
an extensive use on such so-called Likert Scale:
Obstacles à l’innovation (échelle de 1 à 5) Intensité de collaboration (échelle de 1 à 5) Enquête de marketing (N’apprécie pas (1) – Apprécie (7)) Note d’étudiants Test d’opinion Etc.
Ordered multinomial models
*n 1
*n1 2*n2 3
*3 k
y 1 si y
y 2 si y
y 3 si y
y k si y
M
Such variables depict a vertical scale – quantitative, so that one can
think of them as describing the interval in which an unobserved
latent variable y* lies:
where αj are unknown bounds to be estimated.
Ordered multinomial models
i i*i x uy
We assume that the latent variable y* is a linear combination of the
set of all explanatory variables
where ui follows a cumulative distribution function F(.). The
probabilities with each occurrence y (y ≠ y*) are then following the cdf
F(.). Let us look at the probability that y = 1 :
1 i
1 i
i i
i i
x
1 i x
*1i
1
1
P(y 1) P
P(y 1) P x u
P(y 1) P u x
eP(y 1) x
1 e
y
Ordered multinomial models
The probability that y = 2 is:
i 1 i
i 1 i
2
2
x x
2 i 1 i x x
* *2 1i iP(y 2) P P
e eP(y 2) x x
1 e 1 e
y y
Altogether we have:
1 i
2 i 1 i
3 i 2 i
k 1 i
P(Y 1) x
P(Y 2) x x
P(Y 3) x x
P(Y k) 1 x
M
Probability in a ordered model
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
y=3y=2y=1 y=k
1 ix 2 ix 3 ix k 1 ix
ui
The likelihood function
j
0 n
k n
dyn k
j i j-1 ii=1 j=1
y, x,
avec
F( - x ) 0
F( - x ) 1
L( , ) = F( x ) F( x )
The likelihood function is
If ui follows a logistic distribution, the (log) likelihood function reads :
j i j-1 i
j i j-1 i
x xn kji x x
i 1 j 1
jj i j-1 i
j i j-1 i
dyx xn k
x xi=1 j=1
y, x,
et donc
e ey, x, dy ln
1 e 1 e
e eL( , ) =1 e 1 e
LL( , ) =
The likelihood function
Ordered multinomial logit models
Stata Instruction : ologit
ologit y x1 x2 x3 … xk [if] [weight] [, options]
Options : noconstant : omits the constant
robust : controls for heteroskedasticity
if : select observations
weight : weights observations
Ordered multinomial modelsuse est_var_qual.dta, clear ologit innovativeness size rdi spe biotech
Goodness of fit
Estimated parameters
Cutoff points
Interpretation of coefficients
i i1.95
i 1.95
i
1P(y 1) P x ue
P(y 1) P 270.5 u 268.6 .12451 e
P(y 1) P u 1.9
A positive (negative) sign indicates a positive relationship between the independent variable and the order (or rank)
How does one interpret the cutoff values? The model is:
What is then the probability that Y = 1 : P(Y = 1) ? What is the probability that the score be inferior to the first cutoff point? i iScore x u
i i1.95
i 1.95
i 2 i 1 i
i i1.95
i 1.95
i
1
2
P(y 1) P x ue
P(y 1) P 270.5 u 268.6 .12451 e
P(y 1) P u 1.9 P(Y 2) F x F x
P(Y
P(y 1) P x ue
P(y 1) P 270.5 u 269.3 .23211 e
P(y 1) P u 1.2
2) .2321 .1245
P(Y 2) .1076
What is the probability that Y = 2 : P(Y = 2) ?
Interpretation of coefficients
STATA computation of pred. prob.prvalue computes the predicted probabilities
Count Data Models Part 1
The Poisson Model
Count data models
Let us now focus on outcome counting the number of
occurrences a given event. Analyzing the number of
innovations, the number patents, of invention.
Again OLS fails to meet the constrain that the prediction
must be nil or positive. To explain count variables, we
assume that the dependent variable follows a Poisson
distribution.
Poisson models
Let Y be a random count variabl. The probability that Y be equal to
integer yi is given by the Poisson probability density distribution:
To introduce the set of explanatory variables in the model, we condition
λi and impose the following log linear form:
i iyi
i ii
i
eP Y y , y 0,1,2,...
y !
with E Y var Y
ixi
i i
e
ln x
Poisson distributions
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0,8 1,5
2,9 10,5
Lambda
The (log) likelihood function reads:
i
nx
i i ii 1
i iyni
i=1 i
y,
et donc
y, x, y x e ln y !
eL( ) =
y !
LL( ) =
The likehood function
Poisson models
Stata Instruction : poisson
Poisson y x1 x2 x3 … xk [if] [weight] [, options]
Options : noconstant : omits the constant
robust : controls for heteroskedasticity
if : select observations
weight : weights observations
Poisson modelsuse est_var_qual.dta, clear Poisson patent lrdi lassets spe biotech
Estimated parameters
Goodness of fit
_cons -3.12651 .1971912 -15.86 0.000 -3.512997 -2.740022 biotech 1.073271 .051571 20.81 0.000 .9721939 1.174349 spe .7623891 .0441729 17.26 0.000 .6758118 .8489664 lassets .4705428 .0133588 35.22 0.000 .4443601 .4967256 lrdi .699484 .0326467 21.43 0.000 .6354977 .7634704 patent Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -3549.8433 Pseudo R2 = 0.1992 Prob > chi2 = 0.0000 LR chi2(4) = 1766.11Poisson regression Number of obs = 431
Iteration 2: log likelihood = -3549.8433 Iteration 1: log likelihood = -3549.8433 Iteration 0: log likelihood = -3549.9316
. poisson patent lrdi lassets spe biotech
Interpretation of coefficients
i i
ln x 1 xln x ; x
x x xln
If variables are entered in log, one can interpret the coefficients as elasticities
A one % increase in firm size is associated with a .47% increase in the expected number of patents
_cons -3.12651 .1971912 -15.86 0.000 -3.512997 -2.740022 biotech 1.073271 .051571 20.81 0.000 .9721939 1.174349 spe .7623891 .0441729 17.26 0.000 .6758118 .8489664 lassets .4705428 .0133588 35.22 0.000 .4443601 .4967256 lrdi .699484 .0326467 21.43 0.000 .6354977 .7634704 patent Coef. Std. Err. z P>|z| [95% Conf. Interval]
Interpretation of coefficients
i i
ln x 1 xln x ; x
x x xln
A one % increase R&D investment is associated with a .69% increase in the expected number of patents
If variables are entered in log, one can interpret the coefficients as elasticities
_cons -3.12651 .1971912 -15.86 0.000 -3.512997 -2.740022 biotech 1.073271 .051571 20.81 0.000 .9721939 1.174349 spe .7623891 .0441729 17.26 0.000 .6758118 .8489664 lassets .4705428 .0133588 35.22 0.000 .4443601 .4967256 lrdi .699484 .0326467 21.43 0.000 .6354977 .7634704 patent Coef. Std. Err. z P>|z| [95% Conf. Interval]
Interpretation of coefficients
A one-point rise in the degree of specialisation is associated with a 113% increase in the expected number of patents
If variables are not entered in log, the interpretation changes
100 × (eβ – 1)
_cons -3.12651 .1971912 -15.86 0.000 -3.512997 -2.740022 biotech 1.073271 .051571 20.81 0.000 .9721939 1.174349 spe .7623891 .0441729 17.26 0.000 .6758118 .8489664 lassets .4705428 .0133588 35.22 0.000 .4443601 .4967256 lrdi .699484 .0326467 21.43 0.000 .6354977 .7634704 patent Coef. Std. Err. z P>|z| [95% Conf. Interval]
Interpretation of coefficients
For dummy variables, the interpretation changes slightly
Biotechnology firms have an expected number of patents which is 191% higher than pharmaceutical companies.
_cons -3.12651 .1971912 -15.86 0.000 -3.512997 -2.740022 biotech 1.073271 .051571 20.81 0.000 .9721939 1.174349 spe .7623891 .0441729 17.26 0.000 .6758118 .8489664 lassets .4705428 .0133588 35.22 0.000 .4443601 .4967256 lrdi .699484 .0326467 21.43 0.000 .6354977 .7634704 patent Coef. Std. Err. z P>|z| [95% Conf. Interval]
All variables are very significant
… but …
E Y var Y
Interpretation of coefficients
patent 431 10.83295 17.622 0 202 Variable Obs Mean Std. Dev. Min Max
Count Data Models Part 2Negative Binomial Models
Negative binomial models
Generally, the Poisson model is not valid, due to the presence of
overdispersion in the data. This violates the asumption of equality
between the mean and variance if the dependent variable implied by
the Poisson model.
The negative binomial model treats this problem by adding to the log
linear form a unobserved heterogeneity term ui:
i i i i iln v ln ln u x
ii iyu
i ii
i
e uP Y y
y !
Negative binomial modelsThe density of yi is obtained by taking the density of ui :
ii i
i
yuui i 1
i i i i i ii0
wie u
f Y y | x g u du e uth g uy !
Assuming that ui is distributed Gamma with mean 1, the density of
yi reads:
iy
i ii i
i i i
yY y x
y 1
f |
Likelihood Functions
i
i
yn
i i
i 1 i i i
nx
i i ii 1
yL y, ,
y 1
LL y,x , y x y ln e ln
where α is the overdispersion parameter
Negative binomial models
Stata Instruction: nbreg
nbreg y x1 x2 x3 … xk [if] [weight] [, options]
Options : noconstant : omits the constant
robust : controls for heteroskedasticity
if : select observations
weight : weights observations
Negative binomial modelsuse est_var_qual.dta, clear nbreg poisson PAT rdi size spe biotech
Goodness of fit
Estimated parameters
Overdispersion parameter
Overdispersiontest
Likelihood-ratio test of alpha=0: chibar2(01) = 4351.01 Prob>=chibar2 = 0.000 alpha 1.382602 .1049868 1.191411 1.604473 /lnalpha .323967 .0759342 .1751387 .4727953 _cons -5.179659 .8629357 -6.00 0.000 -6.870982 -3.488337 biotech 1.515035 .2384884 6.35 0.000 1.047606 1.982464 spe .8390795 .1950958 4.30 0.000 .4566988 1.22146 lassets .6167106 .0600372 10.27 0.000 .4990399 .7343813 lrdi .7823229 .1121344 6.98 0.000 .5625434 1.002102 patent Coef. Std. Err. z P>|z| [95% Conf. Interval]
Log likelihood = -1374.339 Pseudo R2 = 0.0427Dispersion = mean Prob > chi2 = 0.0000 LR chi2(4) = 122.62Negative binomial regression Number of obs = 431
Interpretation of coefficients
If variables are entered in log, one can still interpret the coefficients as elasticities
A one % increase in firm size is associated with a .61% increase in the expected number of patents
Interpretation of coefficients
If variables are entered in log, one can still interpret the coefficients as elasticities
A one % increase in R&D investment is associated with a .78% increase in the expected number of patents
Interpretation of coefficients
If variables are not entered in log, the interpretation changes
100 × (eβ – 1)
A one-point rise in the degree of specialisation is associated with a 129% increase in the expected number of patents
Interpretation of coefficients
For dummy variables, the interpretation follows the same transformation
100 × (eβ – 1)
Biotechnology firms have an expected number of patents which is 352% higher than pharmaceutical companies.
Overdispersion test
We use the LR test to compare the negative binomial model with the Poisson model
NBREG PRMLR 2 ln L ln L 2 3055 6110
-4536-1481 -
The results indicate the probability to reject H0 wrongly is almost nil (H0: Alpha=0). Hence there is overdispersion in the data and as a
consequence one shopuld use the negative binomial model
Larger standard errors and lower z values
legend: b/t N 431 431 alpha 1.383 Statistics 4.27 _cons 0.324 lnalpha -15.86 -6.00 _cons -3.127 -5.180 20.81 6.35 biotech 1.073 1.515 17.26 4.30 spe 0.762 0.839 35.22 10.27 lassets 0.471 0.617 21.43 6.98 lrdi 0.699 0.782 patent Variable Poisson NegBin
Extensions
ML estimators All models can be extended to a panel context to take full
account of unobserved heterogeneity Fixed effect Random effects
Heckman models Selection bias Two equations, one on the probability to be observed
Survival models Discrete time (complementary log-log, logit) Continuous time (Cox model)