logistic regression
DESCRIPTION
Logistic Regression. Database Marketing Instructor: N. Kumar. Logistic Regression vs TGDA. Two-Group Discriminant Analysis Implicitly assumes that the Xs are Multivariate Normally (MVN) Distributed This assumption is violated if Xs are categorical variables - PowerPoint PPT PresentationTRANSCRIPT
Logistic Regression
Database Marketing
Instructor: N. Kumar
Logistic Regression vs TGDA
Two-Group Discriminant Analysis Implicitly assumes that the Xs are Multivariate
Normally (MVN) Distributed This assumption is violated if Xs are categorical
variables Logistic Regression does not impose any
restriction on the distribution of the Xs Logistic Regression is the recommended
approach if at least some of the Xs are categorical variables
DataFavored Stock Less Favored Stock
Success Size Success Size1 1 0 11 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 0 0 01 0 0 0
Contingency Table
Type of Stock
Large Small Total
Preferred 10 2 12
Not Preferred
1 11 12
Total 11 13 24
Basic Concepts
Probability Probability of being a preferred stock =
12/24 = 0.5 Probability that a company’s stock is
preferred given that the company is large = 10/11 = 0.909
Probability that a company’s stock is preferred given that the company is small = 2/13 = 0.154
Concepts … contd.
Odds Odds of a preferred stock = 12/12 = 1 Odds of a preferred stock given that the
company is large = 10/1 = 10 Odds of a preferred stock given that the
company is small = 2/11 = 0.182
Odds and Probability
Odds(Event) = Prob(Event)/(1-Prob(Event))
Prob(Event) = Odds(Event)/(1+Odds(Event))
Logistic Regression
Take Natural Log of the odds: ln(odds(Preferred|Large)) = ln(10) = 2.303 ln(odds(Preferred|Small)) = ln(0.182) = -1.704
Combining these relationships ln(odds(Preferred|Size)) = -1.704 + 4.007*Size Log of the odds is a linear function of size The coefficient of size can be interpreted like the
coefficient in regression analysis
Interpretation
Positive sign ln(odds) is increasing in size of the company i.e. a large company is more likely to have a preferred stock vis-à-vis a small company
Magnitude of the coefficient gives a measure of how much more likely
General Model
ln(odds) = 0 + 1X1 + 2X2 +…+ kXK (1)
Recall: Odds = p/(1-p)
ln(p/1-p) = 0 + 1X1 + 2X2 +…+ kXK (2)
p =
p =
110
110
1 X
X
e
e
)( 1101
1Xe
Logistic Function
00.10.20.30.40.50.60.70.80.9
1
-50 -30 -10 10 30 50
X
p Series1
Estimation
Coefficients in the regression model are estimated by minimizing the sum of squared errors
Since, p is non-linear in the parameter estimates we need a non-linear estimation technique Maximum-Likelihood Approach Non-Linear Least Squares
Maximum Likelihood Approach
Conditional on parameter , write out the probability of observing the data
Write this probability out for each observation Multiply the probability of each observation
out to get the joint probability of observing the data condition on
Find the that maximizes the conditional probability of realizing this data
Logistic Regression
Logistic Regression with one categorical explanatory variable reduces to an analysis of the contingency table
Interpretation of Results
Look at the –2 Log L statistic Intercept only: 33.271 Intercept and Covariates: 17.864 Difference: 15.407 with 1 DF (p=0.0001) Means that the size variable is explaining a
lot
Do the Variables Have a Significant Impact? Like testing whether the coefficients in the
regression model are different from zero Look at the output from Analysis of Maximum
Likelihood Estimates Loosely, the column Pr>Chi-Square gives you the
probability of realizing the estimate in the Parameter estimate column if the estimate were truly zero – if this value is < 0.05 the estimate is considered to be significant
Other things to Look for
Akaike’s Information Criterion (AIC), Schwartz’s Criterion (SC) – this like Adj-R2 – so there is a penalty for having additional covariates
The larger the difference between the second and third columns – the better the model fit
Interpretation of the Parameter Estimates ln(p/(1-p)) = -1.705 + 4.007*Size
p/(1-p) = e(-1.705) e(4.007*Size)
For a unit increase in size, odds of being a favored stock go up by e4.007 = 54.982
Predicted Probabilities and Observed Responses The response variable (success)
classifies an observation into an event or a no-event
A concordant pair is defined as that pair formed by an event with a PHAT higher than that of the no-event
Higher the Concordant pair % the better
Classification
For a set of new observations where you have information on size alone
You can use the model to predict the probability that success = 1 i.e. the stock is favored
If PHAT > 0.5 success = 1else success=2
Logistic Regression with multiple independent variables
Independent variables a mixture of continuous and categorical variables
Data
Favored Stock Less Favored StockSuccess Size fp Success Size fp
1 1 0.58 0 1 2.281 1 2.8 0 0 1.061 1 2.77 0 0 1.081 1 3.5 0 0 0.071 1 2.67 0 0 0.161 1 2.97 0 0 0.71 1 2.18 0 0 0.751 1 3.24 0 0 1.611 1 1.49 0 0 0.341 1 2.19 0 0 1.151 0 2.7 0 0 0.441 0 2.57 0 0 0.86
General Model
ln(odds) = 0 + 1Size + 2FP
ln(p/1-p) = 0 + 1Size + 2FP
p =
p =
FPSize
FPSize
e
e210
210
1
)( 2101
1FPSizee
Estimation & Interpretation of the Results
Identical to the case with one categorical variable
Summary
Logistic Regression or Discriminant Analysis
Techniques differ in underlying assumptions about the distribution of the explanatory (independent) variables
Use logistic regression if you have a mix of categorical and continuous variables