1 chapter 1: stratified data analysis 1.1 introduction 1.2 examining associations among variables...
TRANSCRIPT
1
Chapter 1: Stratified Data Analysis
1.1 Introduction
1.2 Examining Associations among Variables
1.3 Recursive Partitioning
1.4 Introduction to Logistic Regression
2
Chapter 1: Stratified Data Analysis
1.1 Introduction1.1 Introduction
1.2 Examining Associations among Variables
1.3 Recursive Partitioning
1.4 Introduction to Logistic Regression
Objectives Recognize the differences between categorical
and continuous data analysis. Identify the scale of measurement for your
response variable.
3
Categorical versus Continuous Data Analysis
4
Identifying the Scale of Measurement
Before analyzing, select the measurement scale for each variable.
5
VARIABLE
AGREE
NO OPINION
DISAGREE
Nominal VariablesVariable: Type of Beverage
or
6
1 2 3
1 2 3
Ordinal Variables
7
Variable: Size of Beverage
Small Medium Large
Continuous Variables
8
0
1.0
3.0
2.0
Variable: Volume of Beverage
4.0
1.01 QuizA car dealer records several inventory variables, including Type (automatic or standard), Time (the number of seconds it takes for the car to go from 0 to 60 mph), and Model (basic, middle, or luxury).
Match the modeling type on the left with the appropriate component on the right.
1. Continuous A. Type
2. Ordinal B. Time
3. Nominal C. Model
10
1.01 Quiz – Correct AnswerA car dealer records several inventory variables, including Type (automatic or standard), Time (the number of seconds it takes for the car to go from 0 to 60 mph), and Model (basic, middle, or luxury).
Match the modeling type on the left with the appropriate component on the right.
1. Continuous A. Type
2. Ordinal B. Time
3. Nominal C. Model
11
1-B, 2-C, 3-A
What’s Next?
12
Ah ha!Ordinal! Agree
No Opinion
Disagree
opinion
13
14
Chapter 1: Stratified Data Analysis
1.1 Introduction
1.2 Examining Associations among Variables1.2 Examining Associations among Variables
1.3 Recursive Partitioning
1.4 Introduction to Logistic Regression
Objectives Examine the distribution of categorical variables. Determine whether an association exists among
categorical variables. Perform a stratified analysis of categorical variables.
15
Sample Data Set
16
17
This demonstration illustrates the concepts discussed previously.
Examining Distributions
Association An association exists between two variables if the
distribution of one variable changes when the level (or values) of the other variable changes.
If there is no association, the distribution of the first variable is the same, regardless of the level of the other variable.
18
No Association
19
72% 28%
28%72%
Is your manager’s mood associatedwith the weather?
Association
20
82% 18%
40%60%
Is your manager’s mood associatedwith the weather?
21
This demonstration illustrates the concepts discussed previously.
Recognizing Associations
1.02 QuizIs there an association between finishing a prescription (Rx) and experiencing a relapse?
23
1.02 Quiz – Correct AnswerIs there an association between finishing a prescription (Rx) and experiencing a relapse?
Yes. The distribution of Yes/No for Did not finish Rx is different from the distribution of Yes/No for Finished Rx.
24
Tests for Association
25
Row percents of Income by Purchase
$100 + Under $100
Low 32% 68%
Medium 32% 68%
High 48% 52%
Purchase
Income
Null Hypothesis There is no association between Income and
Purchase. The probability of purchasing items of $100 or more
is the same, regardless of income level.
26
Alternative Hypothesis There is an association between Income and
Purchase. The probability of purchasing items over $100 is different
between Low, Medium, and High income customers.
27
Chi-Square Test
28
NO ASSOCIATIONobserved frequencies = expected frequencies
ASSOCIATIONobserved frequencies = expected frequencies\
p-Value for Chi-Square TestThis p-value is the probability of observing a chi-square statistic at least
as large as the one actually observed, given that there is no association between the variables
probability of the association you observe in the data occurring by chance.
29
Chi-Square TestsChi-square tests and the corresponding p-values determine whether an association exists do not measure the strength of an association depend on and reflect the sample size.
30
31
This demonstration illustrates the concepts discussed previously.
Chi-Square Test
1.03 QuizIs there sufficient evidence that an association exists between Relapsed and Rx Status?
33
1.03 Quiz – Correct AnswerIs there sufficient evidence that an association exists between Relapsed and Rx Status?
Yes there is sufficient evidence that an association exists between Relapsed and Rx Status. The p-value for the Pearson chi-square statistic is .0005, so at alpha=.05, there is sufficient evidence to reject the null (that no association exists) in favor of the alternative (that an association exists).
34
When Not to Use the Chi-Square Test
35
When more than 20% of the cellshave expected counts less than five
2
Expected
Observed versus Expected Values
36
3.43 4.57 6.00
4.41 5.88 7.71
4.16 5.55 7.29
Observed Values Expected Values
1 5 8
5 6 7
6 5 6
Small Samples – Fisher’s Exact Test
37
Fisher’sExactTest
SAMPLE SIZE
Small
Large
Example: Tea and MilkSuppose you want to test whether someone can determine if a cup of tea with milk had the milk poured first or the tea poured first.
38
Fisher’s Exact Test Example9 Cups of Tea: 4 with Milk First and 5 with Tea First
Predict which cups had tea poured first.
39
4
5
4 5
M
T
M T
FixedMarginalTotalsA
ctu
al
Guess
Basis for Fisher’s Exact Test
40
0
4
4
1
4
4
5
5
2
2
2
3
4
4
5
5
3
1
1
4
4
4
5
5
row and columntotals fixed
Otherpossibilities
M
M
T
T
3 4
5
4 5
0
0 5
4
Ac
tua
l
Guess
1
3
3
2
4
4
5
5
Fisher’s Exact Test HypothesesNull Hypothesis: There is no association.
Alternative Hypothesis: There is an association. Two-tailed Left-tailed Right-tailed
41
Left-Tailed Alternative Hypothesis
42
0
4
4
1
4
4
5
5
Left-tailed p-value
M
1
3
3
2
4
4
5
5
M
T
T
Ac
tua
l
Guess
Right-Tailed Alternative Hypothesis
43
Right-tailed p-value
M
1
3
3
2
4
4
5
5
M
T
T 2
2
2
3
4
4
5
5
3
1
1
4
4
4
5
5
4 0
0 5
4
4 5
5
Ac
tua
l
Guess
Two-Tailed Alternative Hypothesis
44
0
4
4
1
4
4
5
5
Two-tailed p-value
M
1
3
3
2
4
4
5
5
M
T
T 2
2
2
3
4
4
5
5
3
1
1
4
4
4
5
5
4
4
5
5
4 0
0 5A
ctu
al
Guess
45
This demonstration illustrates the concepts discussed previously.
Fisher’s Exact Test
1.04 QuizWhat can you conclude from each of the p-values from the Fisher’s Exact Test for the association between Relapsed and Rx Status?
47
1.04 Quiz – Correct AnswerWhat can you conclude from each of the p-values from the Fisher’s Exact Test for the association between Relapsed and Rx Status?
The Left p-value = .0007, so there is sufficient evidence to conclude that the probability of a relapse is greater for those who did not finish the Rx than for those who did.
The Right p-value = .9999, so there is not sufficient evidence to conclude that the probability of a relapse is greater for those who finished the Rx than for those who did not.
The 2-Tail p-value = .0008, so there is sufficient evidence to conclude that the probability of a relapse is different depending on whether a Rx was finished or not.
48
What Happens If There Is a Third Variable?
49
Income
Gender
$100
Stratified Data Analysis Stratified data analysis is the process of dividing
subjects into groups defined by the levels of a third variable.
Use this analysis when you want to examine the association between two variables within the levels of a third variable.
50
Stratified Data Analysis
Of the 39 single people, 23% have lung cancer and 77% do not. Of the 36 married people, 17% have lung cancer and 83% do not.
51
Stratified Data Analysis
Of the 28 single smokers, 28% have lung cancer and 72% do not. Of the 14 married smokers, 28% have lung cancer and 72% do not.
52
Cochran-Mantel-Haenszel Statistics
53
CMH versus Chi-Square
54
1. Correlation of Scores
55
B
A
Test linear association
2. Row Scores by Column Categories
56
B
A
Test equal row scores
3. Column Scores by Row Categories
57
B
A
Test equal column scores
4. General Association of Categories
58
B
A
22
Test general association
CMH Statistics and 2x2 Tables
59
2 X 2CMH
statisticsare all equal
When Do CMH Statistics Lack Power?
60
Response Reversed in Strata
61
This demonstration illustrates the concepts discussed previously.
CMH Tests
62
63
Exercise
This exercise reinforces the concepts discussed previously.
1.05 Multiple Choice PollThe Correlation of Scores CMH test has which null hypothesis?
a. There is no linear association between the row and column variables in any stratum.
b. The mean scores for each column are equal in each stratum.
c. The mean scores for each row are equal in each stratum.
d. There is no association between the row and column variables in any stratum.
65
1.05 Multiple Choice Poll – Correct AnswerThe Correlation of Scores CMH test has which null hypothesis?
a. There is no linear association between the row and column variables in any stratum.
b. The mean scores for each column are equal in each stratum.
c. The mean scores for each row are equal in each stratum.
d. There is no association between the row and column variables in any stratum.
66
67
Chapter 1: Stratified Data Analysis
1.1 Introduction
1.2 Examining Associations among Variables
1.3 Recursive Partitioning1.3 Recursive Partitioning
1.4 Introduction to Logistic Regression
Objectives Define partitioning. Understand the splitting criteria used in JMP. Review algorithm parameters available in JMP. Use the Partition platform in JMP.
68
Recursive PartitioningPartitioning refers to segmenting the data into groups that are as homogeneous as possible with respect to the dependent variable (Y).
69
Divide and Conquer
70
n = 42 n = 261
size (Large) size (Medium, Small)
What factors affect the country from which cars are purchased?
n =303
Country
Tree Algorithm: Calculate Separation of the Response
71
X1
Separation of Response
Tree Algorithm: Find Best Split for the Independent Variable
72
X1
Best Split X1
Tree Algorithm: Repeat for the Other Independent Variables
73
X2
Separation of Means
Tree Algorithm: Compare the Best Splits
74
Best Split X2
Best Split X1
Tree Algorithm: Partition with Best Split
75
Tree Algorithm: Repeat within Partitions
76
77
This demonstration illustrates the concepts discussed previously.
Recursive Partitioning
78
79
Exercise
This exercise reinforces the concepts discussed previously.
1.06 QuizIn which leaf, and on what variable, will JMP next split?
81
1.06 Quiz – Correct AnswerIn which leaf, and on what variable, will JMP next split?
Of the leaves, the highest LogWorth is for Age (.7313), in the Gender(Female) leaf. This is where JMP will next split.
82
83
Chapter 1: Stratified Data Analysis
1.1 Introduction
1.2 Examining Associations among Variables
1.3 Recursive Partitioning
1.4 Introduction to Logistic Regression1.4 Introduction to Logistic Regression
Objectives Explain the concepts of logistic regression. Fit a logistic regression model using JMP software. Examine logistic regression output.
84
Overview
85
Categorical or
Continuous
Continuous
Categorical
Linear Regression
Analysis
Logistic Regression
Analysis
Predictor Response Analysis
Types of Logistic Regression
86
Nominal
Ordinal
BinaryTwo
Categories
Threeor More
Categories
Response VariableType of
Logistic Regression
Binary
Nominal
Ordinal
Yes No
What Does Logistic Regression Do? The logistic regression model uses the predictor
variables, which can be categorical or continuous, to predict the probability of specific outcomes.
In other words, logistic regression is designed to describe probabilities associated with the values of the response variable.
87
The Logistic Curve The relationship between the probability of a response
variable and a predictor variable might be an S-shaped curve.
Linear regression cannot model this relationship, but logistic regression can.
88
Logistic Regression Curves
This graph shows the relationship between the probability of Sale to
Price.
89
Logit Transformation
90
logit( ) logpppi
i
i
1
where
i indexes all cases (observations).
pi is the probability that the event (a sale, for example)occurs in the ith case.
1- pi is the probability that the event (a sale, for example) does not occur in the ith case
log is the natural log (to the base e).
Assumption
91
pi
Predictor Predictor
LogitTransform
Logistic Regression Model
92
logit (pi) = B0 + B1X1
where
logit(pi) is the logit transformation of the probability of the event
B0 is the intercept of the regression line
B1 is the slope of the regression line.
Likelihood Function A likelihood function expresses the probability of the
observed data as a function of the unknown categorical parameters.
The goal is to derive values of the parameters such that the probability of the observed data is as large as possible.
93
Maximum Likelihood Estimate
94
Log-
likel
ihoo
d
Model Inference
95
0
LogL1
LogL0
Log-likelihood
Log-likelihoodfunction
Logistic Curve
96
WeakRelationship
StrongRelationship
Very StrongRelationship
Example of Binary Logistic Regression ModelYou want to predict the probability of defaulting on credit card payments based on having or not having a history of late payments. You can postulate this model:
logit (Probability of Defaulting) = B0 + B1*(Late Payment)
97
98
This demonstration illustrates the concepts discussed previously.
Binary Logistic Regression
1.07 QuizYou want to predict the probability of a defect, given the width of a product. What kind of association exists between Defect and Width – a strong relationship or a weak relationship?
100
1.07 Quiz – Correct AnswerYou want to predict the probability of a defect, given the width of a product. What kind of association exists between Defect and Width – a strong relationship or a weak relationship?
Weak – The fitted regression line is nearly flat, indicating a weak association between Defect and Width.
101
Multiple Logistic Regression
102
Interaction
103
104
This demonstration illustrates the concepts discussed previously.
Multiple Logistic Regression
What Is an Odds Ratio?An odds ratio indicates how much more likely, with respect to odds, a certain event occurs in one group relative to its occurrence in another group.
Example: How much more likely are females to purchase 100 dollars or more in items compared to males?
Example: How much more likely is a person with a history of late payments on credit cards to default on a loan relative to a person who does not have a history of late payments?
105
Probability of Outcome
106
Default on Loan
Yes No
Yes Late Payments(Group A)
20 60
No Late Payments(Group B)
10 90
Total 30 150
Probability of defaulting = 20/80 (.25)in Group A
Probability of not defaulting = 60/80 (.75)in Group A
Total80
100
180
Odds
107
Odds of Outcome in Group A
probability ofdefaulting in group with history of
late payments
probability ofnot defaulting in group with
history of late payments
0.25 ÷ 0.75 = 0.33
÷
Odds Ratio
108
Odds Ratio of Group A to Group B
odds ofdefaulting in group with history of
late payments
odds ofdefaulting ingroup with
no history oflate payments
0.33 ÷ 0.11 = 3
÷
Properties of the Odds Ratio
109
Odds Ratio from a Logistic Regression ModelFor a predictor variable that has only two levels, you can exponentiate twice the parameter estimate that JMP provides to obtain the odds ratio.
Estimated odds ratio = exp(2*parameter estimate)
What are the odds a female purchases more than 100 dollars in items compared to a male?
110
(Gender)*3019.06141.0)ˆ(logit p
1.83
0.3019)exp(2= odds
111
This demonstration illustrates the concepts discussed previously.
Odds Ratios
112
113
Exercise
This exercise reinforces the concepts discussed previously.
1.08 Multiple Choice PollSuppose processes A and B are used to make a product, and each product is evaluated as defective or non-defective. Suppose the probability of a defective from A is .2 and of a non-defective from A is .8. Which is true?
a. The odds of a defective from group A is given by .8/.2 = 4.
b. The odds of a defective from group A is given by .2/.8 = .25.
115
1.08 Multiple Choice Poll – Correct AnswerSuppose processes A and B are used to make a product, and each product is evaluated as defective or non-defective. Suppose the probability of a defective from A is .2 and of a non-defective from A is .8. Which is true?
a. The odds of a defective from group A is given by .8/.2 = 4.
b. The odds of a defective from group A is given by .2/.8 = .25.
116
1.09 Multiple Choice PollThe odds of getting a defective product from process A is .25. What is its interpretation?
a. You expect only 1/4 as many defectives as non-defectives from process A.
b. You expect only 1/4 as many defectives as non-defectives from process B.
117
1.09 Multiple Choice Poll – Correct AnswerThe odds of getting a defective product from process A is .25. What is its interpretation?
a. You expect only 1/4 as many defectives as non-defectives from process A.
b. You expect only 1/4 as many defectives as non-defectives from process B.
118