# logistic regression for survey data - naval postgraduate...

Post on 19-Apr-2018

213 views

Embed Size (px)

TRANSCRIPT

Logistic Regression for Survey Data

Professor Ron FrickerNaval Postgraduate School

Monterey, California

1

Goals for this Lecture

Introduction to logistic regression Discuss when and why it is useful Interpret output

Odds and odds ratios Illustrate use with examples

Show how to run in JMP Discuss other software for fitting linear

and logistic regression models to complex survey data

2

Logistic Regression

Logistic regression Response (Y) is binary representing event or not Model, where pi=Pr(Yi=1):

In surveys, useful for modeling: Probability respondent says yes (or no)

Can also dichotomize other questions Probability respondent in a (binary) class

3

0 1 1 2 2ln 1i

i i k kii

p X X Xp

= + + + + K

Why Logistic Regression?

Some reasons: Resulting S curve fits many observed

phenomenon Model follows the same general principles as

linear regression Can estimate probability p of binary outcome

Estimates of p bounded between 0 and 1

( )( )

0 1 1 2 2

0 1 1 2 2

exp

1 expk k

k k

x x xp

x x x

+ + + +=

+ + + + +

K

K

4

Linear Regression with Binary Ys

Example: modeling presence or absence of coronary heart disease (CHD) as a function of age

Data looks like this: 100 obs min age = 20 max age = 69 43 w/ CHD

ID Age CHD1 20 02 23 03 24 04 25 05 25 16 26 07 26 08 28 0

.. . .. . .. . 5

Modeling CHD Existence

Imagine each subject flips a coin: Heads = CHD Tails = no CHD

Each coin has a different probability of heads related to subjects age

Only observe existence of CHD y=1, has CHD; y=0, does not

We want to model the chance of getting CHD as a function of age

6

Proportion with CHD by Age

CHDAge Group n Absent Present Proportion

20-29 10 9 1 0.1030-34 15 13 2 0.1335-39 12 9 3 0.2540-44 15 10 5 0.3345-49 13 7 6 0.4650-54 8 3 5 0.6355-59 17 4 13 0.7660-69 10 2 8 0.80Total 100 57 43 0.43

7

Plotting the Proportions

00.10.20.30.40.50.60.70.80.9

1

20 30 40 50 60 70

Mean Group Age

Prop

ortio

n w

/ CH

D

8

Interpreting Model Results

0

0.2

0.4

0.6

0.8

1

10 30 50 70 90Age

p(C

HD

)

If age is 50 years then the probability of CHD is about 0.56 9

Logistic Regression: The Picture

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0 10 20 30 40 50 60 70 80 90 100

Age

Prob

abili

ty o

f CHD

datap(age)

10

Where Logistic Regression Fits

Con

tinuo

usC

a teg

ori c

alD

epe n

d ent

or R

e spo

nse

Independent or Predictor VariableContinuous Categorical

Linear regression

Linear reg. w/ dummy variables

Logistic regression

Logistic reg. w/ dummy variables

11

Logistic Regression in JMP

Fit much like multiple regression: Analyze > Fit Model Fill in Y with nominal binary dependent

variable Put Xs in model by highlighting and then

clicking Add Use Remove to take out Xs

Click Run Model when done Takes care of missing values and non-

numeric data automatically12

Estimating the Parameters

JMP estimates s via maximum likelihood Given estimated s, probabilities

estimated as

Calculating probabilities in JMP is easy After Fit Model, red triangle > Save

Probability Formula

( )( )

0 1 1 2 2 3

0 1 1 2 2 3

exp

1 expk

k

x x xp

x x x

+ + + +=

+ + + + +

K

K

13

Probability, Odds, and Log Odds

Probability (p) Number between 0 and 1 Example: Pr(Red Sox win next World Series) = 5/8 = 0.62

Odds: p/(1-p) Any number > 0 Example: Odds Red Sox win World Series are 5/3 = 1.667

Log odds: ln(p/1-p) Any number from - to + Log odds is sometimes called the logit

14

Interpreting the s

slope p-value

Log odds of having CHD

Slope is positive and significant Increasing age means higher probability of

coronary heart disease Increase Age by 1 year and log odds of CHD

increases by 0.11 No t-test, -square test instead

p-value still means the same thing15

Final Model and Results

Age can be any (positive) number and answer still makes sense

0

0.2

0.4

0.6

0.8

1

10 30 50 70 90

Age

p(C

HD

)

exp( 5.31 0.111 x ) (CHD)1 exp( 5.31 0.111 x )

agepage

+=

+ +

16

An odds ratio is, literally, ratio of two odds Example from some recent (non-survey) work:

Odds IAer retained = 2.01 Odds non-IAer retained = 1.55 Odds ratio = 1.30

17

Odds Ratios An Example

Interpreting the Slope of an Indicator Variable

Let x1 be an indicator variable Say, x1=1 means male and x1=0 means female

Consider the ratio of two logistic regression models, one for males and one for females:

Exponentiate numerator and denominator:

0 1 2 2

0 2 2

|male |femaleln ln1 |male 1 |female

i k kii i

i i i k ki

X Xp pp p X X

+ + + += + + +

K

K

0 1 2 2

01

2 2

exp( )exp( ) exp( ) exp( )exp( )exp( ) ex

exp( ) O. .)

Rp(

i k ki

i k ki

X XX X

==L

L

18

Example: Using Logistic Regression in NPS New Student Survey

Dichotomize Q1 into satisfied (4 or 5) and not satisfied (1, 2, or 3)

Model satisfied on Gender and Type Student

19

Compare the Output to Raw Data

20

Regression in Complex Surveys

Parameters are fit to minimize the sums of squared errors to the population:

Resulting estimators:

and

Still need to estimate standard errors

1 22

i i i i i i i i

i S i S i S i S

i i i i ii S i S i S

w x y w y w x wB

w x w x w

=

1

0

i i i ii S i S

ii S

w y B w wB

w

=

[ ]( )20 11

N

i ii

SSE y B B x=

= +

21

Using SAS for Regression

SAS procedures for regression assuming SRS: PROC REG PROC LOGISTIC

In SAS v9.1 for complex surveys PROC SURVEYREG PROC SURVEYLOGISTIC

See http://support.sas.com/onlinedoc/913/docMainpage.jsp22

Using Stata for Regression

Stata 9: SVY procedures for regression include svy:regress svy:logistic svy:logit

See www.stata.com/stata9/svy.html for more detail

23

Using R / S+ for Regression

survey package by Thomas Lumley Must install as library for S+ or R Copy up on Blackboard

Has svyglm for generalized linear models If like usual glm in S+, can do linear and

logistic modeling But I need to look more closely at it

See http://faculty.washington.edu/tlumley/survey/24

What We Have Just Learned

Introduced logistic regression Discussed when and why it is useful Interpreted output

Odds and odds ratios Illustrated use with examples

Showed how to run in JMP Discussed other software for fitting

linear and logistic regression models to complex survey data

25

Logistic Regression for Survey DataGoals for this LectureLogistic RegressionWhy Logistic Regression?Linear Regression with Binary YsModeling CHD ExistenceProportion with CHD by AgePlotting the ProportionsInterpreting Model ResultsLogistic Regression: The PictureWhere Logistic Regression FitsLogistic Regression in JMPEstimating the ParametersProbability, Odds, and Log OddsInterpreting the b sFinal Model and ResultsOdds Ratios An ExampleInterpreting the Slope of an Indicator VariableExample: Using Logistic Regression in NPS New Student SurveyCompare the Output to Raw DataRegression in Complex SurveysUsing SAS for RegressionUsing Stata for RegressionUsing R / S+ for RegressionWhat We Have Just Learned