logistic regression

Logistic Regression

Database Marketing

Instructor: N. Kumar

Logistic Regression vs TGDA

Two-Group Discriminant Analysis Implicitly assumes that the Xs are Multivariate

Normally (MVN) Distributed This assumption is violated if Xs are categorical

variables Logistic Regression does not impose any

restriction on the distribution of the Xs Logistic Regression is the recommended

approach if at least some of the Xs are categorical variables

DataFavored Stock Less Favored Stock

Success Size Success Size1 1 0 11 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 1 0 01 0 0 01 0 0 0

Contingency Table

Type of Stock

Large Small Total

Preferred 10 2 12

Not Preferred

1 11 12

Total 11 13 24

Basic Concepts

Probability Probability of being a preferred stock =

12/24 = 0.5 Probability that a company’s stock is

preferred given that the company is large = 10/11 = 0.909

Probability that a company’s stock is preferred given that the company is small = 2/13 = 0.154

Concepts … contd.

Odds Odds of a preferred stock = 12/12 = 1 Odds of a preferred stock given that the

company is large = 10/1 = 10 Odds of a preferred stock given that the

company is small = 2/11 = 0.182

Odds and Probability

Odds(Event) = Prob(Event)/(1-Prob(Event))

Prob(Event) = Odds(Event)/(1+Odds(Event))

Logistic Regression

Take Natural Log of the odds: ln(odds(Preferred|Large)) = ln(10) = 2.303 ln(odds(Preferred|Small)) = ln(0.182) = -1.704

Combining these relationships ln(odds(Preferred|Size)) = -1.704 + 4.007*Size Log of the odds is a linear function of size The coefficient of size can be interpreted like the

coefficient in regression analysis

Interpretation

Positive sign ln(odds) is increasing in size of the company i.e. a large company is more likely to have a preferred stock vis-à-vis a small company

Magnitude of the coefficient gives a measure of how much more likely

General Model

ln(odds) = 0 + 1X1 + 2X2 +…+ kXK (1)

Recall: Odds = p/(1-p)

ln(p/1-p) = 0 + 1X1 + 2X2 +…+ kXK (2)

p =

p =

110

110

1 X

X

e

e

)( 1101

1Xe

Logistic Function

00.10.20.30.40.50.60.70.80.9

1

-50 -30 -10 10 30 50

X

p Series1

Estimation

Coefficients in the regression model are estimated by minimizing the sum of squared errors

Since, p is non-linear in the parameter estimates we need a non-linear estimation technique Maximum-Likelihood Approach Non-Linear Least Squares

Maximum Likelihood Approach

Conditional on parameter , write out the probability of observing the data

Write this probability out for each observation Multiply the probability of each observation

out to get the joint probability of observing the data condition on

Find the that maximizes the conditional probability of realizing this data

Logistic Regression

Logistic Regression with one categorical explanatory variable reduces to an analysis of the contingency table

Interpretation of Results

Look at the –2 Log L statistic Intercept only: 33.271 Intercept and Covariates: 17.864 Difference: 15.407 with 1 DF (p=0.0001) Means that the size variable is explaining a

lot

Do the Variables Have a Significant Impact? Like testing whether the coefficients in the

regression model are different from zero Look at the output from Analysis of Maximum

Likelihood Estimates Loosely, the column Pr>Chi-Square gives you the

probability of realizing the estimate in the Parameter estimate column if the estimate were truly zero – if this value is < 0.05 the estimate is considered to be significant

Other things to Look for

Akaike’s Information Criterion (AIC), Schwartz’s Criterion (SC) – this like Adj-R2 – so there is a penalty for having additional covariates

The larger the difference between the second and third columns – the better the model fit

Interpretation of the Parameter Estimates ln(p/(1-p)) = -1.705 + 4.007*Size

p/(1-p) = e(-1.705) e(4.007*Size)

For a unit increase in size, odds of being a favored stock go up by e4.007 = 54.982

Predicted Probabilities and Observed Responses The response variable (success)

classifies an observation into an event or a no-event

A concordant pair is defined as that pair formed by an event with a PHAT higher than that of the no-event

Higher the Concordant pair % the better

Classification

For a set of new observations where you have information on size alone

You can use the model to predict the probability that success = 1 i.e. the stock is favored

If PHAT > 0.5 success = 1else success=2

Logistic Regression with multiple independent variables

Independent variables a mixture of continuous and categorical variables

Data

Favored Stock Less Favored StockSuccess Size fp Success Size fp

1 1 0.58 0 1 2.281 1 2.8 0 0 1.061 1 2.77 0 0 1.081 1 3.5 0 0 0.071 1 2.67 0 0 0.161 1 2.97 0 0 0.71 1 2.18 0 0 0.751 1 3.24 0 0 1.611 1 1.49 0 0 0.341 1 2.19 0 0 1.151 0 2.7 0 0 0.441 0 2.57 0 0 0.86

General Model

ln(odds) = 0 + 1Size + 2FP

ln(p/1-p) = 0 + 1Size + 2FP

p =

p =

FPSize

FPSize

e

e210

210

1

)( 2101

1FPSizee

Estimation & Interpretation of the Results

Identical to the case with one categorical variable

Summary

Logistic Regression or Discriminant Analysis

Techniques differ in underlying assumptions about the distribution of the explanatory (independent) variables

Use logistic regression if you have a mix of categorical and continuous variables

logistic regression

Documents

preferred stock vis

companys stock

large company

xslogistic regression

small companymagnitude

1x1 2x2 kxk2p

logistic functionchart10

mvn distributedthis