logistic regression - gtk.uni-miskolc.hugtk.uni-miskolc.hu/files/11485/l10_logistic+reg.pdf · •...

• Gazdaságtudományi Kar• Gazdaságelméleti és Módszertani Intézet

Logistic regression

Quantitative Statistical MethodsDr. Szilágyi Roland


Independent variable(x)

Qualitative Quantitative

De

pe

nd

en

t(y

)

Qu

alit

ativ

e

crosstabsDiscriminant-analysis,

Logistic regression

Qu

anti

t

ativ

e

ANOVACorrelation-, regression

analysis

Connection Analysis


Logistic regression• The Logistic regression is a multivariate method

that helps to predict the classification of casesinto groups on the basis of independent variables.So those independent variables (x) are identifiedin the analysis, which cause significant differencein the dependent variables categories .

– binary (the dependent variable has two categories)

– Multinomial


Logistic regression in practice

• Market research

• Modelling (by or no)

• Segmentation reliability

• Enterprise analysis (default, non default)

• etc.


Stages of Analysis

1• General Purpose

2• Assumptions

3• Estimaton of Function Coefficients

4• Interpretation of Results

5• Validity Tests


General purposes• To create logistic regresion function, which is the best

split of the categories of dependent variables as linear combination of independent variables.

• To determine whether there is a significant difference among groups according to independent variables.

• To determine which independent variables explain the most the differences among groups.

• Based on the experience obtained by a known classification, we can predict the group membership of new cases analyzing their independent variables.

• To measure the accuracy of classification


The Assumptions for Logistic regression1. Measure of variables

• The dependent variable should be categorized by m (at least 2) text values (e.g.: 1-good student, 2-bad student; or 1-prominent student, 2-average, 3-bad student).

• Independent variables could be measuredon whatever scale.


The Assumptions for Logistic regression2. Independence

Not only the explanatory variables, but also all

cases must be independent. Therefore,

panel, longitudinal research, or pre-test data

cannot be used for logistic regression

analysis.


The Assumptions for Logistic regression3. Sample size

It is a general rule, that the larger is the sample size, the more significant is the model. The ratio of number of data to the number of variables is also important. The results can be more generalized if we have at least 60 observations.


The Assumptions for Logistic regression4. Multivariate normal distribution

In case of normal distribution, the estimation of parameters are easier, because the parameters can be defined according to the density or distribution function. It can be tested by histograms of frequency distributions or hypothesis testing.


The Assumptions for Logistic regression5. Multicollinearity

Independent variables should be correlated to the dependent variable, however there must be no correlation between the independent variables, because it can bias the results of analysis.


Binary Logistic Regression• The logistic function is useful because it can take any

input linear combination of independent variables (Xi),

whereas the output always takes values between zero

and one and hence is interpretable as a probability. The

logistic function is defined as follows:

• Note that F(x) is interpreted as the probability of the

dependent variable equaling a "success" or "case" ratherthan a failure or non-case.

pp

pp

xx

xx

xe

eF

...

...

)(110

110

1

)1( XYP i


Binary Logistic Regression

• We can now define the inverse of the logistic

function, the „logit” (log odds):

pp

x

xxx

F

FY ...

1ln 110

)(

)(

ppxx

x eodds

...110

after exponentiating



• „The odds of the dependent variable

equaling a case (given some linearcombination xi of the predictors) is

equivalent to the exponential function of

the linear regression expression

x

x

xP

Podds

1



pp

pp

xx

xx

xe

eP

...

...

110

110

1

x

x

xP

Podds

1


Maximum Likelihood Method

• The maximum likelihood method finds a set of coefficients (β), called the maximum likelihood estimates, at which the log-likelihood function attains its local maximum:

Forrás: Hajdu Ottó: Többváltozós statisztikai számítások; KSH, Budapest, 2003.

max11

...

...

110

110

n

ixx

xx

pp

pp

e

eL


Tests of Model Fit

The Binary Logistic Regression procedure reports the Hosmer-Lemeshow goodness-of-fit statistic. Ithelps you to determine whether the model adequately describes the dataHo: model fitsH1: model don’t fitThe Hosmer–Lemeshow test specifically identifies subgroups deciles of fitted risk values. Models for which expected and observed event rates in subgroups are similar (khi square) are called fitted (well calibrated).


Testinf of parameters (β)

0:

0:

1

0

i

i

H

H

2

i

ii

)s(b

b=Wald


Choosing the Right Model

• Based on residual sum of squares (linear regression)

• Based on Likelihood ratio (compare the Likelihood of the model with the Likelihood of a baseline (minimal) model)

• Proportion of good predictions.


Pseudo R2

• Cox and Snell's R2 is based on the log likelihood for the model compared to the log likelihood for a baseline model. However, with categorical outcomes, it has a theoretical maximum value of less than 1, even for a "perfect" model.

• Nagelkerke's R2 is an adjusted version of the Cox & Snell R-square that adjusts the scale of the statistic to cover the full range from 0 to 1.


Example

• If you are a loan officer at a bank, then you want to be able to identify characteristics that are indicative of people who are likely to default on loans, and use those characteristics to identify good and bad credit risks. Variables

Age in years

Level of education

Years with current employer

Years at current address

Household income in thousands

Debt to income ratio (x100)

Credit card debt in thousands

Other debt in thousands

Previously defaulted


Outputs

Classification Tablea,b

Observed

Predicted

Selected Casesc Unselected Casesd,e

Previously defaulted Percenta

ge Correct

Previously defaulted Percenta

ge CorrectNo Yes No Yes

Step 0Previouslydefaulted

No 375 0 100,0 142 0 100,0

Yes 124 0 ,0 59 0 ,0

Overall Percentage 75,2 70,6

a. Constant is included in the model.

b. The cut value is ,500

Source: Help- IBM SPSS Statistics


Hosmer and Lemeshow TestStep Chi-square df Sig.1 3,292 8 ,9152 11,866 8 ,1573 9,447 8 ,3064 4,027 8 ,855

Source: Help- IBM SPSS Statistics


Model Summary

Step-2 Log

likelihoodCox & Snell R

SquareNagelkerke R

Square1 498,012a ,116 ,172

2 447,301b ,201 ,299

3 411,553b ,257 ,381

4 394,721c ,281 ,417

Source : Help- IBM SPSS Statistics


Classification Tablea

Observed

Predicted

Selected Casesb Unselected Casesc,d

Previously defaultedPercentage

Correct

Previously defaultedPercentage

CorrectNo Yes No Yes

Step 1

Previouslydefaulted

No 361 14 96,3 137 5 96,5Yes 100 24 19,4 45 14 23,7


Step 2


No 351 24 93,6 136 6 95,8Yes 80 44 35,5 36 23 39,0


Step 3


No 348 27 92,8 135 7 95,1Yes 72 52 41,9 28 31 52,5


Step 4


No 352 23 93,9 130 12 91,5Yes 67 57 46,0 27 32 54,2


a. The cut value is ,500


Classification table (Confusion matrix)

(predicted)

no (0) yes (1)

(observed)

no (0) true negative (TN)False positive (FP)

Type I

→specificity

TN/(TN+FP)

yes (1)False negative (FN)

Type IITrue positive (TP)

→ sensitivity

TP/(FN+TP)

↓

negative predictive

value

TN/(TN+FN)

↓

positive predictive

value (precision)

TP/(FP+TP)

accuracy

(TP+TN)/

(TN+FP+FN+TP)


Variables in the Equation

B S.E. Wald df Sig. Exp(B)95% C.I.for EXP(B)Lower Upper

Step 1a Debt to income ratio (x100) ,121 ,017 52,676 1 ,000 1,129 1,092 1,166

Constant -2,476 ,230 116,315

1 ,000 ,084

Step 2b Years with current employer -,140 ,023 38,158 1 ,000 ,869 ,831 ,909

Debt to income ratio (x100) ,134 ,018 54,659 1 ,000 1,143 1,103 1,185

Constant -1,621 ,259 39,038 1 ,000 ,198

Step 3c Years with current employer -,244 ,033 54,676 1 ,000 ,783 ,734 ,836


Credit card debt in thousands ,506 ,101 25,127 1 ,000 1,658 1,361 2,021

Constant -1,058 ,280 14,249 1 ,000 ,347

Step 4d Years with current employer -,247 ,034 51,826 1 ,000 ,781 ,731 ,836

Years at current address -,089 ,023 15,109 1 ,000 ,915 ,875 ,957


Credit card debt in thousands ,602 ,111 29,606 1 ,000 1,826 1,470 2,269

Constant -,605 ,301 4,034 1 ,045 ,546


Meaning of coefficients

• The meaning of a logistic regression coefficient is not as straightforward as that of a linear regression coefficient. While B is convenient for testing the usefulness of predictors, Exp(B) is easier to interpret. Exp(B) represents the ratio-change in the odds of the event of interest for a one-unit change in the predictor (Xi ) CeterisParibus (all other things being equal).



Thank you for your attention!

email: [email protected]

logistic regression - gtk.uni-miskolc.hugtk.uni-miskolc.hu/files/11485/l10_logistic+reg.pdf · •...

Documents