logistic regression who survived titanic?. 2 the sinking of titanic titanic sank april 14th 1912...

Post on 21-Dec-2015

228 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Logistic regression

Who survived Titanic?

2

The sinking of Titanic

Titanic sank April 14th 1912 with 2228 souls 705 survived. A dataset of 1309 passengers survived. Who survived?

3

The data

Sibsp is the number of siblings and/or spouses accompanyingParsc is the number of parents and/or children accompanying Some values are missingCan we predict who will survive titanic II?

pclass survived name sex age sibsp parch

1 1 Allen, Miss. Elisabeth Walton female 29 0 0

1 1 Allison, Master. Hudson Trevor male 0.9167 1 2

1 0 Allison, Miss. Helen Loraine female 2 1 2

1 0 Allison, Mr. Hudson Joshua Creighton male 30 1 2

1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25 1 2

1 1 Anderson, Mr. Harry male 48 0 0

1 1 Andrews, Miss. Kornelia Theodosia female 63 1 0

1 0 Andrews, Mr. Thomas Jr male 39 0 0

1 1 Appleton, Mrs. Edward Dale (Charlotte Lamson) female 53 2 0

Carsten D. Mørch
Bliver det behandlet?

4

Analyzing the data in a (too) simple manner

• Associations between factors without considering interactions

5

Analyzing the data in a (too) simple manner

• Associations between factors without considering interactions

6

Analyzing the data in a (too) simple manner

• Associations between factors without considering interactions

7

Analyzing the data in a (too) simple manner

• Associations between factors without considering interactions

8

Analyzing the data in a (too) simple manner

• Associations between factors without considering interactions

9

Could we use multiple linear regression to predict survival?

0 1 1( ) ... n nE y x x

multiple linear regression Logistic regression

Response variable is defined between –inf and +inf

Response variable is defined between 0 and 1

Normal distributed Bernoulli distributed

10

Logit transformation is modeled linearly

The logistic function

0 1 1

0 1 1

0 1 1 0 1 1

ln ...1

exp ... 1

1 exp ... 1 exp ...

n n

n n

n n n n

px x

p

x xp

x x x x

11

The sigmodal curve

0 1 1

1

1 e...

z

n n

p

z x x

-6 -4 -2 0 2 4 60

0.2

0.4

0.6

0.8

1

x

p

sigmodal curve

0 = 0;

1 = 1

12

The sigmodal curve

• The intercept basically just ‘scale’ the input variable

0 1 1

1

1 e...

z

n n

p

z x x

-6 -4 -2 0 2 4 60

0.2

0.4

0.6

0.8

1

x

p

sigmodal curve

0 = 0;

1 = 1

0 = 2;

1 = 1

0 = -2;

1 = 1

13

The sigmodal curve

0 1 1

1

1 e...

z

n n

p

z x x

-6 -4 -2 0 2 4 60

0.2

0.4

0.6

0.8

1

x

p

sigmodal curve

0 = 0;

1 = 1

0 = 0;

1 = 2

0 = 0;

1 = 0.5

• The intercept basically just ‘scale’ the input variable

• Large regression coefficient → risk factor strongly influences the probability

14

The sigmodal curve

0 1 1

1

1 e...

z

n n

p

z x x

-6 -4 -2 0 2 4 60

0.2

0.4

0.6

0.8

1

x

p

sigmodal curve

0 = 0;

1 = 1

0 = 0;

1 = -1

• The intercept basically just ‘scale’ the input variable

• Large regression coefficient → risk factor strongly influences the probability

• Positive regression coefficient → risk factor increases the probability

15

Logistic regression of the Titanic data

16

Logistic regression of the Titanic data

1. Summary of data2. Coding of the dependent

variable3. Coding of the categorical

explanatory variable:First class: 1Second class: 2Third class: reference

17

Logistic regression of the Titanic data

A fit of the null-model, basically just the intercept. Usually not interesting

• The total probability of survival is 500/1309 = 0.382. Cutoff is 0.5 so all are classified as non-survivers.

• Basically tests if the null-model is sufficient. It almost certainly is not.

• Shows that survival is related to pclass (which is not in the null-model)

18

Logistic regression of the Titanic data

1. Omnibus test: Uses LR to describe if the adding the pclass variable to the model makes it better. It did! But better than the null-model, so no surprise.

2. Model Summary. Other measures of the goodness of fit.

3. Classification table: By including pclass 67.7 passengers were correctly categorized.

4. Variables in the equation: first line repeats that pclass has a significant effect on survival. B is the logistic fittet parameter. Exp(B) is the odds rations, so the odds of survival is 4.7 (3.6-6.3) times higher than passengers on third class (reference class)

19

Logistic regression of the Titanic data now adding family relations

1. ‘3 or more’ is set as reference groups by SPSS

20

Logistic regression of the Titanic data now adding family relations

1. The model correctly classify 79.1% of the passengers

21

Logistic regression of the Titanic data now adding family relations

1. Basically all factors seems to affect the probability of survival.

22

How was it with age?

• Linear associations are easy to model, because the factor enters the predictive value directly.

• But it is not really look linear, maybe a third order polynomial?

• Three new factors for age is calculated: first, second, and third order of the age divided by the standard diviation.

23

How was it with age?

•The third-order age factor did not add significantly to the model. •By adding third order polynomial the model can correctly categorize 79.4

vs 79.1 before.•ParChild is no longer a significant factor and can be omitted from the

model

24

Using the model to predict survival

• Omitting the second and third order age and ParChild factors

• What is the probability that a 25 year old woman accompanied only by her husband holding a second class ticket would survive Titanic?

z = -3.929-0.589*(-5)/14.41+1.718+2.552+0.926 = 1.47141.4714

1 1

1 e 1 e0.8133

zp

25

Analysing interaction of selected factors

pclass * sex, age * sex, pclass * Siblings/ParentsBut the model does not converge…

26

Analysing interaction of selected factors

Collapsing the sibling/spouse number eradicated their mutual interaction

27

Is it realistic that Leonardo survives and the chick dies?

top related