1 chapter 1: stratified data analysis 1.1 introduction 1.2 examining associations among variables...

1

Chapter 1: Stratified Data Analysis

1.1 Introduction

1.2 Examining Associations among Variables

1.3 Recursive Partitioning

1.4 Introduction to Logistic Regression

2


1.1 Introduction1.1 Introduction




Objectives Recognize the differences between categorical

and continuous data analysis. Identify the scale of measurement for your

response variable.

3

Categorical versus Continuous Data Analysis

4

Identifying the Scale of Measurement

Before analyzing, select the measurement scale for each variable.

5

VARIABLE

AGREE

NO OPINION

DISAGREE

Nominal VariablesVariable: Type of Beverage

or

6

1 2 3

1 2 3

Ordinal Variables

7

Variable: Size of Beverage

Small Medium Large

Continuous Variables

8

0

1.0

3.0

2.0

Variable: Volume of Beverage

4.0

1.01 QuizA car dealer records several inventory variables, including Type (automatic or standard), Time (the number of seconds it takes for the car to go from 0 to 60 mph), and Model (basic, middle, or luxury).

Match the modeling type on the left with the appropriate component on the right.

1. Continuous A. Type

2. Ordinal B. Time

3. Nominal C. Model

10

1.01 Quiz – Correct AnswerA car dealer records several inventory variables, including Type (automatic or standard), Time (the number of seconds it takes for the car to go from 0 to 60 mph), and Model (basic, middle, or luxury).

Match the modeling type on the left with the appropriate component on the right.

1. Continuous A. Type

2. Ordinal B. Time

3. Nominal C. Model

11

1-B, 2-C, 3-A

What’s Next?

12

Ah ha!Ordinal! Agree

No Opinion

Disagree

opinion

14


1.1 Introduction

1.2 Examining Associations among Variables1.2 Examining Associations among Variables



Objectives Examine the distribution of categorical variables. Determine whether an association exists among

categorical variables. Perform a stratified analysis of categorical variables.

15

Sample Data Set

16

17

This demonstration illustrates the concepts discussed previously.

Examining Distributions

Association An association exists between two variables if the

distribution of one variable changes when the level (or values) of the other variable changes.

If there is no association, the distribution of the first variable is the same, regardless of the level of the other variable.

18

No Association

19

72% 28%

28%72%

Is your manager’s mood associatedwith the weather?

Association

20

82% 18%

40%60%

Is your manager’s mood associatedwith the weather?

21


Recognizing Associations

1.02 QuizIs there an association between finishing a prescription (Rx) and experiencing a relapse?

23

1.02 Quiz – Correct AnswerIs there an association between finishing a prescription (Rx) and experiencing a relapse?

Yes. The distribution of Yes/No for Did not finish Rx is different from the distribution of Yes/No for Finished Rx.

24

Tests for Association

25

Row percents of Income by Purchase

$100 + Under $100

Low 32% 68%

Medium 32% 68%

High 48% 52%

Purchase

Income

Null Hypothesis There is no association between Income and

Purchase. The probability of purchasing items of $100 or more

is the same, regardless of income level.

26

Alternative Hypothesis There is an association between Income and

Purchase. The probability of purchasing items over $100 is different

between Low, Medium, and High income customers.

27

Chi-Square Test

28

NO ASSOCIATIONobserved frequencies = expected frequencies

ASSOCIATIONobserved frequencies = expected frequencies\

p-Value for Chi-Square TestThis p-value is the probability of observing a chi-square statistic at least

as large as the one actually observed, given that there is no association between the variables

probability of the association you observe in the data occurring by chance.

29

Chi-Square TestsChi-square tests and the corresponding p-values determine whether an association exists do not measure the strength of an association depend on and reflect the sample size.

30

31


Chi-Square Test

1.03 QuizIs there sufficient evidence that an association exists between Relapsed and Rx Status?

33

1.03 Quiz – Correct AnswerIs there sufficient evidence that an association exists between Relapsed and Rx Status?

Yes there is sufficient evidence that an association exists between Relapsed and Rx Status. The p-value for the Pearson chi-square statistic is .0005, so at alpha=.05, there is sufficient evidence to reject the null (that no association exists) in favor of the alternative (that an association exists).

34

When Not to Use the Chi-Square Test

35

When more than 20% of the cellshave expected counts less than five

2

Expected

Observed versus Expected Values

36

3.43 4.57 6.00

4.41 5.88 7.71

4.16 5.55 7.29

Observed Values Expected Values

1 5 8

5 6 7

6 5 6

Small Samples – Fisher’s Exact Test

37

Fisher’sExactTest

SAMPLE SIZE

Small

Large

Example: Tea and MilkSuppose you want to test whether someone can determine if a cup of tea with milk had the milk poured first or the tea poured first.

38

Fisher’s Exact Test Example9 Cups of Tea: 4 with Milk First and 5 with Tea First

Predict which cups had tea poured first.

39

4

5

4 5

M

T

M T

FixedMarginalTotalsA

ctu

al

Guess

Basis for Fisher’s Exact Test

40

0

4

4

1

4

4

5

5

2

2

2

3

4

4

5

5

3

1

1

4

4

4

5

5

row and columntotals fixed

Otherpossibilities

M

M

T

T

3 4

5

4 5

0

0 5

4

Ac

tua

l

Guess

1

3

3

2

4

4

5

5

Fisher’s Exact Test HypothesesNull Hypothesis: There is no association.

Alternative Hypothesis: There is an association. Two-tailed Left-tailed Right-tailed

41

Left-Tailed Alternative Hypothesis

42

0

4

4

1

4

4

5

5

Left-tailed p-value

M

1

3

3

2

4

4

5

5

M

T

T

Ac

tua

l

Guess

Right-Tailed Alternative Hypothesis

43

Right-tailed p-value

M

1

3

3

2

4

4

5

5

M

T

T 2

2

2

3

4

4

5

5

3

1

1

4

4

4

5

5

4 0

0 5

4

4 5

5

Ac

tua

l

Guess

Two-Tailed Alternative Hypothesis

44

0

4

4

1

4

4

5

5

Two-tailed p-value

M

1

3

3

2

4

4

5

5

M

T

T 2

2

2

3

4

4

5

5

3

1

1

4

4

4

5

5

4

4

5

5

4 0

0 5A

ctu

al

Guess

45


Fisher’s Exact Test

1.04 QuizWhat can you conclude from each of the p-values from the Fisher’s Exact Test for the association between Relapsed and Rx Status?

47

1.04 Quiz – Correct AnswerWhat can you conclude from each of the p-values from the Fisher’s Exact Test for the association between Relapsed and Rx Status?

The Left p-value = .0007, so there is sufficient evidence to conclude that the probability of a relapse is greater for those who did not finish the Rx than for those who did.

The Right p-value = .9999, so there is not sufficient evidence to conclude that the probability of a relapse is greater for those who finished the Rx than for those who did not.

The 2-Tail p-value = .0008, so there is sufficient evidence to conclude that the probability of a relapse is different depending on whether a Rx was finished or not.

48

What Happens If There Is a Third Variable?

49

Income

Gender

$100

Stratified Data Analysis Stratified data analysis is the process of dividing

subjects into groups defined by the levels of a third variable.

Use this analysis when you want to examine the association between two variables within the levels of a third variable.

50

Stratified Data Analysis

Of the 39 single people, 23% have lung cancer and 77% do not. Of the 36 married people, 17% have lung cancer and 83% do not.

51

Stratified Data Analysis

Of the 28 single smokers, 28% have lung cancer and 72% do not. Of the 14 married smokers, 28% have lung cancer and 72% do not.

52

Cochran-Mantel-Haenszel Statistics

53

CMH versus Chi-Square

54

1. Correlation of Scores

55

B

A

Test linear association

2. Row Scores by Column Categories

56

B

A

Test equal row scores

3. Column Scores by Row Categories

57

B

A

Test equal column scores

4. General Association of Categories

58

B

A

22

Test general association

CMH Statistics and 2x2 Tables

59

2 X 2CMH

statisticsare all equal

When Do CMH Statistics Lack Power?

60

Response Reversed in Strata

61


CMH Tests

63

Exercise

This exercise reinforces the concepts discussed previously.

1.05 Multiple Choice PollThe Correlation of Scores CMH test has which null hypothesis?

a. There is no linear association between the row and column variables in any stratum.

b. The mean scores for each column are equal in each stratum.

c. The mean scores for each row are equal in each stratum.

d. There is no association between the row and column variables in any stratum.

65

1.05 Multiple Choice Poll – Correct AnswerThe Correlation of Scores CMH test has which null hypothesis?

a. There is no linear association between the row and column variables in any stratum.

b. The mean scores for each column are equal in each stratum.

c. The mean scores for each row are equal in each stratum.

d. There is no association between the row and column variables in any stratum.

66

67


1.1 Introduction


1.3 Recursive Partitioning1.3 Recursive Partitioning


Objectives Define partitioning. Understand the splitting criteria used in JMP. Review algorithm parameters available in JMP. Use the Partition platform in JMP.

68

Recursive PartitioningPartitioning refers to segmenting the data into groups that are as homogeneous as possible with respect to the dependent variable (Y).

69

Divide and Conquer

70

n = 42 n = 261

size (Large) size (Medium, Small)

What factors affect the country from which cars are purchased?

n =303

Country

Tree Algorithm: Calculate Separation of the Response

71

X1

Separation of Response

Tree Algorithm: Find Best Split for the Independent Variable

72

X1

Best Split X1

Tree Algorithm: Repeat for the Other Independent Variables

73

X2

Separation of Means

Tree Algorithm: Compare the Best Splits

74

Best Split X2

Best Split X1

Tree Algorithm: Partition with Best Split

75

Tree Algorithm: Repeat within Partitions

76

77


Recursive Partitioning

79

Exercise


1.06 QuizIn which leaf, and on what variable, will JMP next split?

81

1.06 Quiz – Correct AnswerIn which leaf, and on what variable, will JMP next split?

Of the leaves, the highest LogWorth is for Age (.7313), in the Gender(Female) leaf. This is where JMP will next split.

82

83


1.1 Introduction



1.4 Introduction to Logistic Regression1.4 Introduction to Logistic Regression

Objectives Explain the concepts of logistic regression. Fit a logistic regression model using JMP software. Examine logistic regression output.

84

Overview

85

Categorical or

Continuous

Continuous

Categorical

Linear Regression

Analysis

Logistic Regression

Analysis

Predictor Response Analysis

Types of Logistic Regression

86

Nominal

Ordinal

BinaryTwo

Categories

Threeor More

Categories

Response VariableType of

Logistic Regression

Binary

Nominal

Ordinal

Yes No

What Does Logistic Regression Do? The logistic regression model uses the predictor

variables, which can be categorical or continuous, to predict the probability of specific outcomes.

In other words, logistic regression is designed to describe probabilities associated with the values of the response variable.

87

The Logistic Curve The relationship between the probability of a response

variable and a predictor variable might be an S-shaped curve.

Linear regression cannot model this relationship, but logistic regression can.

88

Logistic Regression Curves

This graph shows the relationship between the probability of Sale to

Price.

89

Logit Transformation

90

logit( ) logpppi

i

i

1

where

i indexes all cases (observations).

pi is the probability that the event (a sale, for example)occurs in the ith case.

1- pi is the probability that the event (a sale, for example) does not occur in the ith case

log is the natural log (to the base e).

Assumption

91

pi

Predictor Predictor

LogitTransform

Logistic Regression Model

92

logit (pi) = B0 + B1X1

where

logit(pi) is the logit transformation of the probability of the event

B0 is the intercept of the regression line

B1 is the slope of the regression line.

Likelihood Function A likelihood function expresses the probability of the

observed data as a function of the unknown categorical parameters.

The goal is to derive values of the parameters such that the probability of the observed data is as large as possible.

93

Maximum Likelihood Estimate

94

Log-

likel

ihoo

d

Model Inference

95

0

LogL1

LogL0

Log-likelihood

Log-likelihoodfunction

Logistic Curve

96

WeakRelationship

StrongRelationship

Very StrongRelationship

Example of Binary Logistic Regression ModelYou want to predict the probability of defaulting on credit card payments based on having or not having a history of late payments. You can postulate this model:

logit (Probability of Defaulting) = B0 + B1*(Late Payment)

97

98


Binary Logistic Regression

1.07 QuizYou want to predict the probability of a defect, given the width of a product. What kind of association exists between Defect and Width – a strong relationship or a weak relationship?

100

1.07 Quiz – Correct AnswerYou want to predict the probability of a defect, given the width of a product. What kind of association exists between Defect and Width – a strong relationship or a weak relationship?

Weak – The fitted regression line is nearly flat, indicating a weak association between Defect and Width.

101

Multiple Logistic Regression

102

Interaction

103

104


Multiple Logistic Regression

What Is an Odds Ratio?An odds ratio indicates how much more likely, with respect to odds, a certain event occurs in one group relative to its occurrence in another group.

Example: How much more likely are females to purchase 100 dollars or more in items compared to males?

Example: How much more likely is a person with a history of late payments on credit cards to default on a loan relative to a person who does not have a history of late payments?

105

Probability of Outcome

106

Default on Loan

Yes No

Yes Late Payments(Group A)

20 60

No Late Payments(Group B)

10 90

Total 30 150

Probability of defaulting = 20/80 (.25)in Group A

Probability of not defaulting = 60/80 (.75)in Group A

Total80

100

180

Odds

107

Odds of Outcome in Group A

probability ofdefaulting in group with history of

late payments

probability ofnot defaulting in group with

history of late payments

0.25 ÷ 0.75 = 0.33

÷

Odds Ratio

108

Odds Ratio of Group A to Group B

odds ofdefaulting in group with history of

late payments

odds ofdefaulting ingroup with

no history oflate payments

0.33 ÷ 0.11 = 3

÷

Properties of the Odds Ratio

109

Odds Ratio from a Logistic Regression ModelFor a predictor variable that has only two levels, you can exponentiate twice the parameter estimate that JMP provides to obtain the odds ratio.

Estimated odds ratio = exp(2*parameter estimate)

What are the odds a female purchases more than 100 dollars in items compared to a male?

110

(Gender)*3019.06141.0)ˆ(logit p

1.83

0.3019)exp(2= odds

111


Odds Ratios

113

Exercise


1.08 Multiple Choice PollSuppose processes A and B are used to make a product, and each product is evaluated as defective or non-defective. Suppose the probability of a defective from A is .2 and of a non-defective from A is .8. Which is true?

a. The odds of a defective from group A is given by .8/.2 = 4.

b. The odds of a defective from group A is given by .2/.8 = .25.

115

1.08 Multiple Choice Poll – Correct AnswerSuppose processes A and B are used to make a product, and each product is evaluated as defective or non-defective. Suppose the probability of a defective from A is .2 and of a non-defective from A is .8. Which is true?

a. The odds of a defective from group A is given by .8/.2 = 4.

b. The odds of a defective from group A is given by .2/.8 = .25.

116

1.09 Multiple Choice PollThe odds of getting a defective product from process A is .25. What is its interpretation?

a. You expect only 1/4 as many defectives as non-defectives from process A.

b. You expect only 1/4 as many defectives as non-defectives from process B.

117

1.09 Multiple Choice Poll – Correct AnswerThe odds of getting a defective product from process A is .25. What is its interpretation?

a. You expect only 1/4 as many defectives as non-defectives from process A.

b. You expect only 1/4 as many defectives as non-defectives from process B.

118

1 chapter 1: stratified data analysis 1.1 introduction 1.2 examining associations among variables...

Documents

continuous variables

inventory variables

ordinal variables

variable changes

stratified data analysis41

continuous data analysis

response variable

type automatic