lecture 7: binary dependent variable · 2020. 12. 30. · probit model more models: multinomial...

71
Lecture 7: Binary Dependent Variable Introduction ot Econometrics,Fall 2020 Zhaopeng Qu Nanjing University 11/1/2020 Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 1 / 71

Upload: others

Post on 30-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Lecture 7: Binary Dependent VariableIntroduction ot Econometrics,Fall 2020

    Zhaopeng Qu

    Nanjing University

    11/1/2020

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 1 / 71

  • 1 Review of the last lecture

    2 The Linear Probability Model(LPM)

    3 Nonlinear probability model

    4 Maximum Likelihood Estimation

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 2 / 71

  • Review of the last lecture

    Section 1

    Review of the last lecture

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 3 / 71

  • Review of the last lecture

    Nonlinear function

    We extended our model to be nonlinear into nonlinear in XsPolynomials,Logarithms and InteractionsThe multiple regression framework can be extended to handleregression functions that are nonlinear in one or more X.the difference from a standarad multiple OLS regression is how toexplain estimating coefficients.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 4 / 71

  • Review of the last lecture

    Nonlinear in function:

    Today we are going to extend the model to be nonlinear in functionsDiscrete Dependent Variables or Limited Dependent VariablesLinear function is not a good prediciton function.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 5 / 71

  • Review of the last lecture

    Introduction to Discrete or Categorical DependentVariable Model

    So far the dependent variable (Y) has been continuous:testscoreaverage hourly earningsGDP growth rate

    What if Y is discrete?Y= get into college, or not; X = parental income.Y= person smokes, or not; X = cigarette tax rate, income.Y= mortgage application is accepted, or not; X = race, income, housecharacteristics, marital status …

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 6 / 71

  • Review of the last lecture

    Discrete or Categorical Dependent Variable Model

    Binary outcomes models:Logit Probability Model(LPM)Logit modelProbit model

    More Models:Multinomial outcomes: Multiple responses or choices with order(multi-logit and multi-probit)Ordered outcomes: Ordered Response Models(ordered probit and logit)Count outcomes: The outcomes is a nonnegative integer or a count(possion model)Limited Dependent Variable(Censored, Tobit and Selection Models)Time: (Duration Model)

    We will only cover Binary outcomes models as an introductory levelcourse.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 7 / 71

  • The Linear Probability Model(LPM)

    Section 2

    The Linear Probability Model(LPM)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 8 / 71

  • The Linear Probability Model(LPM)

    The Linear Probability Model:the conditionalexpectation

    If a outcome variable 𝑌 is binary, then the expectation of 𝑌 is𝐸[𝑌 ] = 1 × 𝑃𝑟(𝑌 = 1) + 0 × 𝑃𝑟(𝑌 = 0) = 𝑃𝑟(𝑌 = 1)

    which is the probability of 𝑌 = 1.Then we can extend it to the conditional expectation of 𝑌 equals tothe the probability of 𝑌 = 1 conditional on Xs,thus

    𝐸[𝑌 |𝑋1𝑖, ..., 𝑋𝑘𝑖] = 𝑃𝑟(𝑌 = 1|𝑋1𝑖, ..., 𝑋𝑘𝑖)

    Recall: a multiple OLS model𝑌𝑖 = 𝛽0 + 𝛽1𝑋1𝑖 + 𝛽2𝑋2𝑖 + ... + 𝛽𝑘𝑋𝑘𝑖 + 𝑢𝑖

    Then based on Assumption 1,we have𝐸[𝑌 |𝑋1𝑖, ..., 𝑋𝑘𝑖] = 𝛽0 + 𝛽1𝑋1𝑖 + 𝛽2𝑋2𝑖 + ... + 𝛽𝑘𝑋𝑘𝑖

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 9 / 71

  • The Linear Probability Model(LPM)

    The Linear Probability ModelThe conditional expectation equals the probability that 𝑌𝑖 = 1conditional on 𝑋1𝑖, ..., 𝑋𝑘𝑖

    𝐸[𝑌 |𝑋1𝑖, ..., 𝑋𝑘𝑖] = 𝑃𝑟(𝑌 = 1|𝑋1𝑖, ..., 𝑋𝑘𝑖)= 𝛽0 + 𝛽1𝑋1𝑖 + 𝛽2𝑋2𝑖 + ... + 𝛽𝑘𝑋𝑘𝑖

    Now a Linear Probability Model can be defined as following

    𝑃𝑟(𝑌 = 1|𝑋1𝑖, ..., 𝑋𝑘𝑖) = 𝛽0 + 𝛽1𝑋1𝑖 + 𝛽2𝑋2𝑖 + ... + 𝛽𝑘𝑋𝑘𝑖

    The population coefficient 𝛽𝑗𝜕𝑃𝑟(𝑌𝑖 = 1|𝑋1𝑖, ..., 𝑋𝑘𝑖)

    𝜕𝑋𝑗= 𝛽𝑗

    𝛽𝑗 can be explained as the change in the probability that 𝑌 = 1associated with a unit change in 𝑋𝑗

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 10 / 71

  • The Linear Probability Model(LPM)

    The Linear Probability Model

    Almost all of the tools of Multiple OLS regression can carry over tothe LPM model.

    Assumptions are the same as for general multiple regression model.The coefficients can be also estimated by OLS.t-statistic and F-statistic can be constructed as before.the errors of the LPM are always heteroskedastic, so it is essentialthat heteroskedasticity-robust s.e. be used for inference.one difference is that 𝑅2 is not a useful statistic now.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 11 / 71

  • The Linear Probability Model(LPM)

    An Example: Mortgage applicationsMost individuals who want to buy a house apply for a mortgage at abank. Not all mortgage applications are approved.

    Question: What determines whether or not a mortgage application isapproved or denied?

    Boston HMDA data: a data set on mortgage applications collected bythe Federal Reserve Bank in Boston.

    Variable Description Mean SDdeny = 1 if application is denied 0.120 0.325pi_ratio monthly loan payments / monthly income 0.331 0.107black = 1 if applicant is black 0.142 0.350

    Our linear probability model is𝑃𝑟(𝑌 = 1|𝑋1𝑖, 𝑋2𝑖) = 𝛽0 + 𝛽1𝑋1𝑖 + 𝛽2𝑋2𝑖

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 12 / 71

  • The Linear Probability Model(LPM)

    An Example: Mortgage applicationsDoes the payment to income ratio affect whether or not a mortgageapplication is denied?

    𝑑𝑒𝑛𝑦 = −0.080 + 0.604 𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜(0.032)(0.098)

    The estimated OLS coefficient on the payment to income ratiô𝛽1 = 0.60

    The estimated coefficient is significantly different from 0 at a 1%significance level.(the t-statistic is over 6)

    How should we interpret ̂𝛽1 ?An original one: “payments/monthly income ratio increase 1,thenprobability being denied will also increase 0.6.”Another more reasonable one: “payments/monthly income ratioincrease 10%(0.1),then probability being denied will also increase6%(0.06)”.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 13 / 71

  • The Linear Probability Model(LPM)

    An Example: Mortgage Applications

    What is the effect of race on the probability of denial, holdingconstant the P/I ratio? To keep things simple, we focus ondifferences between black applicants and white applicants.

    𝑑𝑒𝑛𝑦 = −0.091 + 0.559 𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜 + 0.177𝑏𝑙𝑎𝑐𝑘(0.029) (0.089) (0.025)

    The coefficient on black, 0.177, indicates that an African Americanapplicant has a 17.7% higher probability of having a mortgageapplication denied than a white applicant, holding constant theirpayment-to-income ratio.

    This coefficient is significant at the 1% level (the t-statistic is 7.11).

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 14 / 71

  • The Linear Probability Model(LPM)

    LPM: shortcomings

    Always suffer heteroskedasticity,use heteroskedasticity robust standarderrors!

    More serious problem: the predicted probability can be below 0 orabove 1!

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 15 / 71

  • The Linear Probability Model(LPM)

    LPM: shortcomings: Predicted value

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 16 / 71

  • The Linear Probability Model(LPM)

    Warp up

    Disadvantages of the linear probability modelPredicted probability can be above 1 or below 0!(it doesn’t make sense)Error terms are heteroskedastic.

    Advantages of the linear probability model:Easy to estimate and inferenceCoefficient estimates are easy to interpretvery useful in some circumstances: Special IV

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 17 / 71

  • Nonlinear probability model

    Section 3

    Nonlinear probability model

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 18 / 71

  • Nonlinear probability model

    Introduction

    Intuition: Probabilities should not be less than 0 or greater than 1

    To address this problem,consider a nonlinear probability models

    𝑃 𝑟(𝑌𝑖 = 1|𝑋1, ...𝑋𝑘) = 𝐺(𝑍)= 𝐺(𝛽0 + 𝛽1𝑋1,𝑖 + 𝛽2𝑋2,𝑖 + ... + 𝛽𝑘𝑋𝑘,𝑖)

    where 𝑍 = 𝛽0 + 𝛽1𝑋1,𝑖 + 𝛽2𝑋2,𝑖 + ... + 𝛽𝑘𝑋𝑘,𝑖 and 0 ≤ 𝐺(𝑍) ≤ 1We could find a proper function 𝐺(𝑥) which can limit the predictionvalue less than 1 and greater than 0.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 19 / 71

  • Nonlinear probability model

    Logit and Probit

    Two types nonlinear functions1 Probit Model

    𝐺(𝑍) = Φ(𝑍) = ∫𝑧

    −∞𝜙(𝑍)𝑑𝑍

    which is the standard normal cumulative distribution function2 Logit Model

    𝐺(𝑍) = 11 + 𝑒−𝑍which is the logistic function

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 20 / 71

  • Nonlinear probability model

    Probit Model

    Probit regression models the probability that 𝑌 = 1

    𝑃𝑟(𝑌𝑖 = 1|𝑋1, ...𝑋𝑘) = Φ(𝛽0 + 𝛽1𝑋1,𝑖 + 𝛽2𝑋2,𝑖 + ... + 𝛽𝑘𝑋𝑘,𝑖)

    where Φ(𝑍) is the standard normal c.d.f, then we have

    0 ≤ Φ(𝑍) ≤ 1

    Then it make sure that the predicted probabilities of the probitmodel are between 0 and 1.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 21 / 71

  • Nonlinear probability model

    Probit Model: Prediction Value

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 22 / 71

  • Nonlinear probability model

    Probit Model:Explaination to the Coefficient

    How should we interpret ̂𝛽1 ?Recall 𝑍 = 𝛽0 + 𝛽1𝑋1,𝑖 + 𝛽2𝑋2,𝑖 + ... + 𝛽𝑘𝑋𝑘,𝑖The coefficient 𝛽𝑗 is the change in the 𝑍-value rather than theprobability arising from a unit change in 𝑋𝑗, holding constant other 𝑋𝑖.

    The effect on the predicted probability of a change in a regressorshould be computed by

    1 computing the predicted probability for the initial value of theregressors,

    2 computing the predicted probability for the new or changed value ofthe regressors,

    3 taking their difference.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 23 / 71

  • Nonlinear probability model

    Probit Model:Explaination to the Coefficient

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 24 / 71

  • Nonlinear probability model

    The Predicted Probability: one regressor

    Suppose the probit population regression model with only oneregressors, 𝑋1

    𝑃𝑟(𝑌 = 1|𝑋1) = Φ(𝑍) = Φ(𝛽0 + 𝛽1𝑋1)

    Suppose the estimate result is ̂𝛽0 = −2 and ̂𝛽1 = 3,which means

    𝑍 = −2 + 3𝑋1

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 25 / 71

  • Nonlinear probability model

    The Predicted Probability: one regressorAnd assume that we want to know the probability that 𝑌 = 1 when𝑋1 = 0.4, then 𝑧 = −2 + 3 × 0.4 = −0.8So the predicted probability is

    𝑃𝑟(𝑌 = 1|𝑋1 = 0.4) = 𝑃𝑟(𝑧 ≤ −0.8) = Φ(−0.8)

    Likewise the probability that 𝑌 = 1 when 𝑋1 = 0.5, then𝑧 = −2 + 3 × 0.5 = −0.5So the predicted probability is

    𝑃𝑟(𝑌 = 1|𝑋1 = 0.4) = 𝑃𝑟(𝑧 ≤ −0.5) = Φ(−0.5)

    Then the difference is𝑃 𝑟(𝑌 = 1|𝑋1 = 0.4) − 𝑃𝑟(𝑌 = 1|𝑋1 = 0.5) = Φ(−.8) − Φ(−.5)

    = 0.2119 − 0.3085= −0.097

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 26 / 71

  • Nonlinear probability model

    The Predicted Probability: one regressor

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 27 / 71

  • Nonlinear probability model

    The Predicted Probability: Multiple regressors

    Suppose the probit population regression model with two regressors,𝑋1 and 𝑋2,

    𝑃𝑟(𝑌 = 1|𝑋1, 𝑋2) = Φ(𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2)

    Suppose 𝛽0 = −1.6, 𝛽1 = 2 𝑎𝑛𝑑 𝛽2 = 0.5. And If𝑋1 = 0.4 𝑎𝑛𝑑 𝑋2 = 1 then

    𝑧 = −1.6 + 2 × 0.4 + 0.5 × 1 = −0.3

    So the probability𝑃𝑟(𝑌 = 1|𝑋1, 𝑋2) = 𝑃𝑟(𝑧 ≤ −0.3) = Φ(−0.3) = 0.38

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 28 / 71

  • Nonlinear probability model

    Example: Mortgage ApplicationsMortgage denial (deny) and the payment-income ratio (P/I ratio)

    ̂𝑃𝑟(𝑑𝑒𝑛𝑦 = 1|𝑃 /𝐼 𝑟𝑎𝑡𝑖𝑜) = Φ(−2.19 + 2.97𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜)

    What is the change in the predicted probability that an applicationwill be denied if P/I ratio increases from 0.3 to 0.4?The probability of denial when 𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜 = 0.3

    Φ(−2.19 + 2.97 × 0.3) = Φ(−1.3) = 0.097

    The probability of denial when 𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜 = 0.4

    Φ(−2.19 + 2.97 × 0.4) = Φ(−1.0) = 0.159

    The estimated change in the probability of denial is0.159 − 0.097 = 0.062, which means that the P/I ratio increase 0.1,the denial probability increase 6.2.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 29 / 71

  • Nonlinear probability model

    Effect of a change in X: When X is continous

    Marginal Effects for 𝑋𝑗

    𝜕𝑃 𝑟(𝑌 = 1|𝑋1, ...𝑋𝑘)𝜕𝑋𝑗

    = 𝜙(𝛽0 +𝛽1𝑋1,𝑖 +𝛽2𝑋2,𝑖 +...+𝛽𝑘𝑋𝑘,𝑖)×𝛽𝑗

    Where 𝜙() is the p.d.f of standard normal.Hence, the effect of a change in X depends on the starting value of Xlike other nonlinear functions.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 30 / 71

  • Nonlinear probability model

    Effect of a change in X:

    For nonlinear models, the ME varies with the point of evaluationMarginal Effect at a Representative Value (MER):ME at 𝑋 = 𝑋∗ (atrepresentative values of the regressors)

    Marginal Effect at Mean (MEM): ME at 𝑋 = �̄�(at the sample meanof the regressors)

    Average Marginal Effect (AME): average of ME at each 𝑋 = 𝑋𝑖 (atsample values and then average)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 31 / 71

  • Nonlinear probability model

    Example: Mortgage applications: marginal effect

    The marginal effect

    𝜕𝑃𝑟(𝑑𝑒𝑛𝑦 = 1|𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜)𝜕𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜 = 𝜙(−2.19 + 2.97𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜) × 2.97

    Then Marginal Effect at Mean (MEM):(at the sample mean of theregressors: 𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜𝑚𝑒𝑎𝑛 = 0.331

    𝜕𝑃𝑟(𝑑𝑒𝑛𝑦 = 1|𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜)𝜕𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜 𝑎𝑡 𝑚𝑒𝑎𝑛

    = 𝜙(−2.19+2.97×0.331)×2.97

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 32 / 71

  • Nonlinear probability model

    The explanatory variable is discrete

    If 𝑋𝑗 is a discrete variable, then we should not rely on calculus inevaluating the effect on the response probability.Assume 𝑋2 is a dummy variable, then partial effect of 𝑋2 changingfrom 0 to 1:

    𝐺(𝛽0+𝛽1𝑋1,𝑖+𝛽2×1+...+𝛽𝑘𝑋𝑘,𝑖)−𝐺(𝛽0+𝛽1𝑋1,𝑖+𝛽2×0+...+𝛽𝑘𝑋𝑘,𝑖)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 33 / 71

  • Nonlinear probability model

    Example: Mortgage applications: RaceMortgage denial (deny) and the payment-income ratio (P/I ratio) andrace

    ̂𝑃𝑟(𝑑𝑒𝑛𝑦 = 1|𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜) = Φ(−2.26 + 2.74𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜 + 0.71𝑏𝑙𝑎𝑐𝑘)

    The probability of denial when 𝑏𝑙𝑎𝑐𝑘 = 0,thus whites(non-blacks) is

    Φ(−2.26 + 2.74 × 0.3 + 0.71 × 0) = Φ(−1.43) = 0.075

    The probability of denial when 𝑏𝑙𝑎𝑐𝑘 = 1,thus blacks is

    Φ(−2.26 + 2.74 × 0.3 + 0.71 × 1) = Φ(−0.73) = 0.233

    so the difference between whites and blacks at 𝑃/𝐼𝑟𝑎𝑡𝑖𝑜 = 0.3 is0.233 − 0.075 = 0.158, which means probability of denial for blacksis 15.8% higher than that for whites.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 34 / 71

  • Nonlinear probability model

    Logit Model

    Using the cumulative standard logistic distribution function

    𝑃𝑟(𝑌𝑖 = 1|𝑍) =1

    1 + 𝑒−𝑍

    Similar to probit model 𝑍 = 𝛽0 + 𝛽1𝑋1,𝑖 + 𝛽2𝑋2,𝑖 + ... + 𝛽𝑘𝑋𝑘,𝑖Since 𝐹(𝑧) = 𝑃𝑟(𝑍 ≤ 𝑧) we have that the predicted probabilities ofthe logit model are between 0 and 1.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 35 / 71

  • Nonlinear probability model

    Logit Model: predicted probabilities

    Suppose we have only one regressor X and 𝑍 = −2 + 3𝑋1We want to know the probability that 𝑌 = 1 when 𝑋1 = 0.4Then 𝑍 = −2 + 3 × 0.4 = −0.8So the probability𝑃𝑟(𝑌 = 1|𝑋1 = 0.4) = 𝑃𝑟(𝑍 ≤ −0.8) = 𝐹(−0.8)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 36 / 71

  • Nonlinear probability model

    Logit Model𝑃𝑟(𝑌 = 1) = 𝑃𝑟(𝑍 ≤ −0.8) = 11+𝑒−0.8 = 0.31

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 37 / 71

  • Nonlinear probability model

    Example: Mortgage applications

    Logit Model: Mortgage denial (deny) and the payment-toincome ratio(P/I ratio) and race

    ̂𝑃𝑟(𝑑𝑒𝑛𝑦 = 1|𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜) = 𝐹(−4.13 + 5.37𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜 + 1.27𝑏𝑙𝑎𝑐𝑘)(0.35) (0.96) (0.15)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 38 / 71

  • Nonlinear probability model

    Example: Mortgage applications: Race

    The predicted denial probability of a white applicant with𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜 = 0.3 is

    11 + 𝑒−(−4.13+5.37×0.3+1.27×0) = 0.074

    The predicted denial probability of a black applicant with𝑃/𝐼 𝑟𝑎𝑡𝑖𝑜 = 0.3 is

    11 + 𝑒−(−4.13+5.37×0.3+1.27×1) = 0.222

    the difference is

    0.222 − 0.074 = 0.148 = 14.8%

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 39 / 71

  • Nonlinear probability model

    How to estimate Logit and Probit models

    nonlinear in the independent variables(X).these models can be estimated by OLS

    While Logit and Probit models are nonlinear in the coefficients𝛽0, 𝛽1, ..., 𝛽𝑘

    these models can NOT be estimated by OLS

    The method used to estimate logit and probit models is MaximumLikelihood Estimation (MLE).

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 40 / 71

  • Maximum Likelihood Estimation

    Section 4

    Maximum Likelihood Estimation

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 41 / 71

  • Maximum Likelihood Estimation

    Introdution to MLE

    The likelihood function is a joint probability distribution of the data,treated as a function of the unknown coefficients.

    The maximum likelihood estimator (MLE) are the values of thecoefficients that maximize the likelihood function.

    MLE’s logic: the most likely function is the function to have producethe data we observed.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 42 / 71

  • Maximum Likelihood Estimation

    Introdution to MLE

    R.V. 𝑋1, 𝑋2, 𝑋3, ...𝑋𝑛 have joint density denoted

    𝑓𝜃(𝑥1, 𝑥2, ..., 𝑥𝑛) = 𝑓(𝑥1, 𝑥2, ..., 𝑥𝑛|𝜃)

    Given observed values 𝑋1 = 𝑥1, 𝑋2 = 𝑥2, ..., 𝑋𝑛 = 𝑥𝑛,the likelihoodof 𝜃 is the function

    𝑙𝑖𝑘(𝜃) = 𝑓(𝑥1, 𝑥2, ..., 𝑥𝑛|𝜃)

    considered as a function of 𝜃.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 43 / 71

  • Maximum Likelihood Estimation

    Maximum likelihood estimation

    Lets start with the simplest case: The MLE estimator with only Y

    Suppose that we are flipping a coin that may be biased, so that theprobability of a head may not be 0.5. Let’s estimate the probability ofa heads using a sample.

    Therefore, let 𝑌 = 1(ℎ𝑒𝑎𝑑𝑠) be a binary variable that indicateswhether or not a heads is observed. The outcome of a toss is aBernoulli random variable: 𝑃𝑟(𝑌 = 1) = 𝑝, which has only oneunknown parameter to estimate, the probability of being a head 𝑝.Since 𝑌𝑖 is a Bernoulli random variable, therefore the density for asingle observation is

    𝑃𝑟(𝑌𝑖 = 𝑦) = 𝑃𝑟(𝑌𝑖 = 1)𝑦(1 − 𝑃𝑟(𝑌𝑖 = 1))1−𝑦 = 𝑝𝑦(1 − 𝑝)1−𝑦

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 44 / 71

  • Maximum Likelihood Estimation

    Maximum likelihood estimation

    Step 1: write down the likelihood function, the joint probabilitydistribution of the data

    𝑌1, ..., 𝑌𝑛 are i.i.d, the joint probability distribution is therefore theproduct of the individual distributions

    𝑃𝑟(𝑌1 = 𝑦1, ..., 𝑌𝑛 = 𝑦𝑛) = 𝑃𝑟(𝑌1 = 𝑦1) × ... × 𝑃𝑟(𝑌𝑛 = 𝑦𝑛)= 𝑝𝑦1(1 − 𝑝)1−𝑦1 × ... × 𝑝𝑦𝑛(1 − 𝑝)1−𝑦𝑛= 𝑝(𝑦1+𝑦2+...+𝑦𝑛)(1 − 𝑝)𝑛−(𝑦1+𝑦2+...+𝑦𝑛)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 45 / 71

  • Maximum Likelihood Estimation

    Maximum likelihood estimationWe have the likelihood function:

    𝑓𝑏𝑒𝑟𝑛𝑜𝑢𝑖𝑙𝑙𝑖(𝑝; 𝑌1 = 𝑦1, ..., 𝑌𝑛 = 𝑦𝑛) = 𝑝∑ 𝑦𝑖(1 − 𝑝)𝑛−∑ 𝑦𝑖

    Step 2: Maximize the likelihood function

    Easier to maximize the logarithm of the likelihood function𝑙𝑛(𝑓𝑏𝑒𝑟𝑛𝑜𝑢𝑖𝑙𝑙𝑖(𝑝; 𝑌1 = 𝑦1, ..., 𝑌𝑛 = 𝑦𝑛))

    = ( ∑ 𝑦𝑖)𝑙𝑛(𝑝) + (𝑛 − ∑ 𝑦𝑖)𝑙𝑛(1 − 𝑝)

    Since the logarithm is a strictly increasing function, maximizing thelikelihood or the log likelihood will give the same estimator.

    Then the maximization problem is𝑎𝑟𝑔 min

    𝑝𝑙𝑛(𝑓𝑏𝑒𝑟𝑛𝑜𝑢𝑖𝑙𝑙𝑖(𝑝; 𝑌1 = 𝑦1, ..., 𝑌𝑛 = 𝑦𝑛))

    Then the maximization problem is𝑎𝑟𝑔 min

    𝑝𝑙𝑛(𝑓𝑏𝑒𝑟𝑛𝑜𝑢𝑖𝑙𝑙𝑖(𝑝; 𝑌1 = 𝑦1, ..., 𝑌𝑛 = 𝑦𝑛))

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 46 / 71

  • Maximum Likelihood Estimation

    Maximum likelihood estimationF.O.C: taking the derivative and setting it to zero.

    𝑑𝑑𝑝 𝑙𝑛(𝑓𝑏𝑒𝑟𝑛𝑜𝑢𝑖𝑙𝑙𝑖(𝑝; 𝑌1 = 𝑦1, ..., 𝑌𝑛 = 𝑦𝑛)) = 0

    𝑑𝑑𝑝[( ∑ 𝑦𝑖)𝑙𝑛(𝑝) + (𝑛 − ∑ 𝑦𝑖)𝑙𝑛(1 − 𝑝)] = 0

    ∑ 𝑦𝑖𝑝 −

    𝑛 − ∑ 𝑦𝑖1 − 𝑝 = 0

    𝑝(𝑛 − ∑ 𝑦𝑖) = ∑ 𝑦𝑖(1 − 𝑝)

    Solving for p yields the MLE; that is, ̂𝑝𝑀𝐿𝐸 satisfies

    ̂𝑝𝑀𝐿𝐸 =1𝑛 ∑ 𝑦𝑖 = 𝑌

    You can prove that ̂𝑝𝑀𝐿𝐸 is an unbiased, consistent and normallydistributed estimator of 𝑌𝑖.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 47 / 71

  • Maximum Likelihood Estimation

    MLE of the probit model

    Assume our probit model is

    𝑃(𝑌𝑖 = 1|𝑋𝑖) = Φ(𝛽0 + 𝛽1𝑋𝑖 + 𝑢𝑖) = 𝑝𝑖

    Step 1: write down the likelihood function

    𝑃𝑟(𝑌1 = 𝑦1, ..., 𝑌𝑛 = 𝑦𝑛) = 𝑃 𝑟(𝑌1 = 𝑦1) × ... × 𝑃𝑟(𝑌𝑛 = 𝑦𝑛)= 𝑝𝑦1(1 − 𝑝)1−𝑦1 × ... × 𝑝𝑦𝑛(1 − 𝑝)1−𝑦𝑛

    = [𝜙(𝛽0 + 𝛽1𝑋1)𝑦1(1 − 𝜙(𝛽0 + 𝛽1𝑋1))1−𝑦1]×

    ... × [𝜙(𝛽0 + 𝛽1𝑋1𝑛)𝑦𝑛(1 − 𝜙(𝛽0 + 𝛽1𝑋1𝑛))1−𝑦𝑛]

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 48 / 71

  • Maximum Likelihood Estimation

    MLE of the probit model

    it is easier to take the logarithm of the likelihood function

    𝑙𝑛(𝑓𝑝𝑟𝑜𝑏𝑖𝑡(𝛽0, 𝛽1; 𝑌1, ..., 𝑌𝑛|𝑋1𝑖, 𝑖 = 1, ..., 𝑛)) =𝑛

    ∑𝑖=1

    𝑌𝑖 × 𝑙𝑛[𝜙(𝛽0 + 𝛽1𝑋1𝑖)] +𝑛

    ∑𝑖=1

    (1 − 𝑌𝑖) × 𝑙𝑛[1 − 𝜙(𝛽0 + 𝛽1𝑋1𝑖)]

    Then the maximization problem is

    𝑎𝑟𝑔 min𝛽0,𝛽1

    𝑙𝑛(𝑓𝑝𝑟𝑜𝑏𝑖𝑡(𝛽0, 𝛽1; 𝑌1 = 𝑦1, ..., 𝑌𝑛 = 𝑦𝑛|𝑋1𝑖, 𝑖 = 1, ..., 𝑛))

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 49 / 71

  • Maximum Likelihood Estimation

    MLE of the probit modelStep 1: write down the likelihood function

    𝑃𝑟(𝑌1 = 𝑦1, ..., 𝑌𝑛 = 𝑦𝑛) = 𝑃𝑟(𝑌1 = 𝑦1) × ... × 𝑃𝑟(𝑌𝑛 = 𝑦𝑛)= 𝑝𝑦1(1 − 𝑝)1−𝑦1 × ... × 𝑝𝑦𝑛(1 − 𝑝)1−𝑦𝑛

    so far it is very similar as the case without explanatory variablesexcept that 𝑝𝑖 depends on 𝑋1𝑖, ..., 𝑋𝑘𝑖

    𝑝𝑖 = 𝜙(𝑋1𝑖, ..., 𝑋𝑘𝑖) = 𝜙(𝛽0 + 𝛽1𝑋1𝑖 + ... + 𝛽𝑘𝑋𝑘𝑖)

    substituting for 𝑝𝑖 gives the likelihood function:

    [𝜙(𝛽0 + 𝛽1𝑋11 + ... + 𝛽𝑘𝑋𝑘1)𝑦1(1 − 𝜙(𝛽0 + 𝛽1𝑋11 + ... + 𝛽𝑘𝑋𝑘1))1−𝑦1]×

    ... × [𝜙(𝛽0 + 𝛽1𝑋1𝑛 + ... + 𝛽𝑘𝑋𝑘𝑛)𝑦𝑛(1 − 𝜙(𝛽0 + 𝛽1𝑋1𝑛 + ... + 𝛽𝑘𝑋𝑘𝑛))1−𝑦𝑛]

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 50 / 71

  • Maximum Likelihood Estimation

    MLE of the probit model

    Also with obtaining the MLE of the probit model it is easier to takethe logarithm of the likelihood function

    Step 2: Maximize the log likelihood function

    𝑙𝑛(𝑓𝑝𝑟𝑜𝑏𝑖𝑡(𝛽0, ..., 𝛽𝑘; 𝑌1, ..., 𝑌𝑛|𝑋1𝑖, ..., 𝑋𝑘𝑖, 𝑖 = 1, ..., 𝑛)) =∑ 𝑌𝑖 × 𝑙𝑛[𝜙(𝛽0 + 𝛽1𝑋1𝑖 + ... + 𝛽𝑘𝑋𝑘𝑖)]

    + ∑(1 − 𝑌𝑖) × 𝑙𝑛[1 − 𝜙(𝛽0 + 𝛽1𝑋1𝑖 + ... + 𝛽𝑘𝑋𝑘𝑖)]

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 51 / 71

  • Maximum Likelihood Estimation

    MLE of the logit model

    Step 1 write down the likelihood function

    𝑃𝑟(𝑌1 = 𝑦1, ..., 𝑌𝑛 = 𝑦𝑛) = 𝑝𝑦1(1 − 𝑝)1−𝑦1 × ... × 𝑝𝑦𝑛(1 − 𝑝)1−𝑦𝑛

    very similar to the Probit model but with a different function for 𝑝𝑖

    𝑝𝑖 =1

    1 + 𝑒−(𝛽0+𝛽1𝑋1𝑖+...+𝛽𝑘𝑋𝑘𝑖)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 52 / 71

  • Maximum Likelihood Estimation

    MLE of the logit model

    Step 2: Maximize the log likelihood function

    𝑙𝑛(𝑓𝑙𝑜𝑔𝑖𝑡(𝛽0, ..., 𝛽𝑘; 𝑌1, ..., 𝑌𝑛|𝑋1𝑖, ..., 𝑋𝑘𝑖, 𝑖 = 1, ..., 𝑛))

    = ∑ 𝑌𝑖 × 𝑙𝑛(1

    1 + 𝑒−(𝛽0+𝛽1𝑋1𝑖+...+𝛽𝑘𝑋𝑘𝑖) )

    + ∑(1 − 𝑌𝑖) × 𝑙𝑛(1

    1 + 𝑒−(𝛽0+𝛽1𝑋1𝑖+...+𝛽𝑘𝑋𝑘𝑖) )

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 53 / 71

  • Maximum Likelihood Estimation

    Computation of MLE estimators

    In most cases the computation of maximum likelihood estimatorssince the first order conditions do not have closed from solutions.

    We can still obtain the values of estimators using numericalalgorithm with iterative methods.

    One of the most common method is the Newton-Raphson methodbased on low order Taylor series expansions.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 54 / 71

  • Maximum Likelihood Estimation

    Computation of MLE estimators: Taylor expressions

    Recall Taylor series of a function 𝑓(𝑥) at a certain value of 𝑥,thus 𝑥0

    𝑓(𝑥) = 𝑓(𝑥0) +𝑓 ′(𝑥0)

    1! (𝑥 − 𝑥0) +𝑓 ′′(𝑥0)

    2! (𝑥 − 𝑥0)2 + ...

    =∞

    ∑𝑛=0

    𝑓 (𝑛)(𝑥0)𝑛! (𝑥 − 𝑥0)

    𝑛

    Then we can have the Taylor expression of 𝑓(𝑥) at first order

    𝑓(𝑥) ≃ 𝑓(𝑥0) + 𝑓 ′(𝑥0)(𝑥 − 𝑥0)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 55 / 71

  • Maximum Likelihood Estimation

    Computation of MLE estimators: Newton-RaphsonMethod

    If we wish to find the solution of 𝑥 so that𝑓(𝑥) = 0

    , thus find some 𝑥, make𝑓(𝑥0) + 𝑓 ′(𝑥0)(𝑥 − 𝑥0) = 0

    here the 𝑥0 is some initial value 𝑥0 we guess, which is close to thedesired solution. And then we obtain a better approximation

    𝑥1 = 𝑥0 −𝑓(𝑥0)𝑓 ′(𝑥1)

    We do not stop repeating this procedure until𝑓(𝑥𝑗) = 0

    , here the 𝑥𝑗 is the solution to the function.Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 56 / 71

  • Maximum Likelihood Estimation

    Computation of MLE estimators

    For simplicity, assume that we work with only one parameter 𝜃, themaximum likelihood function is 𝐿(𝜃𝑀𝐿𝐸)Then the F.O.C for the problem of maximization is as following

    𝜕𝐿(𝜃𝑀𝐿𝐸)𝜕𝜃 = 0

    Let us have a initial guess of the parameter value, which denotes as 𝜃0.Then the F.O.C for the problem of maximization

    𝜕𝐿( ̂𝜃𝑀𝐿𝐸)𝜕𝜃 ≃

    𝜕𝐿(𝜃0)𝜕𝜃 + (

    ̂𝜃𝑀𝐿𝐸 − 𝜃0)𝜕2𝐿(𝜃0)

    𝜕𝜃2Rearrange the equation, the MLE estimator is

    ̂𝜃𝑀𝐿𝐸 ≃ 𝜃0 − [𝜕2𝐿(𝜃0)

    𝜕𝜃2 ]−1 𝜕𝐿( ̂𝜃𝑀𝐿𝐸)

    𝜕𝜃We do not stop repeating this procedure until

    𝑓(𝑥𝑗) = 0, here the 𝑥𝑗 is the solution to the function.Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 57 / 71

  • Maximum Likelihood Estimation

    MLE estimator in practice

    There is no simple formula for the probit and logit MLE, themaximization must be done using numerical algorithm on acomputer.

    Because regression software commonly computes the MLE of theestimate coefficients, this estimator is easy to use in practice.

    It can be prove that the MLE estimator is unbiased,consistent andnormally distributed in large samples.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 58 / 71

  • Maximum Likelihood Estimation

    Statistical inference based on the MLE

    Because the MLE is normally distributed in large samples, statisticalinference about the probit and logit coefficients based on the MLEproceeds in the same way as inference about the linear regressionfunction coefficients based on the OLS estimator.

    That is, hypothesis tests are performed using the t-statistic and95% confidence intervals are formed as 1.96 standard errors.

    Tests of joint hypotheses on multiple coefficients use the F-statisticin a way similar to that discussed for the linear regression model.

    F-statistic and Chi-squared statistic

    𝐹𝑠𝑡𝑎𝑡 ⟶𝜒2𝑞𝑞

    where q is the number of restrictions being tested.Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 59 / 71

  • Maximum Likelihood Estimation

    Measures of Fit

    𝑅2 is a poor measure of fit for the linear probability model. This isalso true for probit and logit regression.

    Two measures of fit for models with binary dependent variables1 fraction correctly predicted

    If 𝑌𝑖 = 1 and the predicted probability exceeds 50% or if 𝑌𝑖 = 0 andthe predicted probability is less than 50%, then 𝑌𝑖 is said to becorrectly predicted.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 60 / 71

  • Maximum Likelihood Estimation

    Measures of Fit

    2 The pseudo-R2

    - The $pseudo-R^2$ compares the value of the likelihood of the estimated model to the value of the likelihood when none of the Xs are included as regressors.

    𝑝𝑠𝑒𝑢𝑑𝑜 − 𝑅2 = 1 −𝑙𝑛(𝑓𝑚𝑎𝑥𝑝𝑟𝑜𝑏𝑖𝑡)

    𝑙𝑛(𝑓𝑚𝑎𝑥𝑏𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖)- $f^{max}_{probit}$ is the value of the maximized probit likelihood (which includes the X’s)- $f^{max}_{bernoulli}$ is the value of the maximized Bernoulli likelihood (the probit model excluding all the X’s).

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 61 / 71

  • Maximum Likelihood Estimation

    Comparing the LPM,Probit and Logit

    All three models: linear probability, probit, and logit are justapproximations to the unknown population regression function𝐸(𝑌 |𝑋) = 𝑃𝑟(𝑌 = 1|𝑋).

    LPM is easiest to use and to interpret, but it cannot capture thenonlinear nature of the true population regression function.Probit and logit regressions model this nonlinearity in the probabilities,but their regression coefficients are more difficult to interpret.

    So which should you use in practice?There is no one right answer, and different researchers use differentmodels.Probit and logit regressions frequently produce similar results.

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 62 / 71

  • Maximum Likelihood Estimation

    Logit v.s. Probit

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 63 / 71

  • Maximum Likelihood Estimation

    Comparing the LPM,Probit and Logit

    The marginal effects and predicted probabilities are much more similaracross models.

    Coefficients can be compared across models, using the followingrough conversion factors (Amemiya 1981)

    ̂𝛽𝑙𝑜𝑔𝑖𝑡 ≃ 4 ̂𝛽𝑜𝑙𝑠̂𝛽𝑝𝑟𝑜𝑏𝑖𝑡 ≃ 2.5 ̂𝛽𝑜𝑙𝑠̂𝛽𝑙𝑜𝑔𝑖𝑡 ≃ 1.6 ̂𝛽𝑝𝑟𝑜𝑏𝑖𝑡

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 64 / 71

  • Maximum Likelihood Estimation

    Example: Mortgage Applications(short regression)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 65 / 71

  • Maximum Likelihood Estimation

    Example: Mortgage Applications(long regression)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 66 / 71

  • Maximum Likelihood Estimation

    Example: Mortgage Applications(long regression)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 67 / 71

  • Maximum Likelihood Estimation

    Example: Mortgage Applications(long regression)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 68 / 71

  • Maximum Likelihood Estimation

    Example: Mortgage Applications(long regression)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 69 / 71

  • Maximum Likelihood Estimation

    Example: Mortgage Applications(long regression)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 70 / 71

  • Maximum Likelihood Estimation

    More Extensions: Categoried and LimitedDependent Variables families

    Binary outcomes: LPM, logit and probitMultinomial outcomes: No order, such as (multi-logit,mprobit)Ordered outcomes: Ordered Response Models(order probit and logit)Count outcomes: The outcomes is a nonnegative integer or a count.(possion model)Limited Dependent Variable(Censored, Tobit and Selection Models)Time: (Duration Model)

    Zhaopeng Qu (Nanjing University) Lecture 7: Binary Dependent Variable 11/1/2020 71 / 71

    Review of the last lectureThe Linear Probability Model(LPM)Nonlinear probability modelMaximum Likelihood Estimation