chi-square test presentation 12. what does it mean for two categorical variables to be related?...
TRANSCRIPT
Chi-Square testChi-Square test
Presentation 12
What does it mean for two What does it mean for two categorical variables to be categorical variables to be
related?related? Remember that Chi-Square is used to test for a Remember that Chi-Square is used to test for a
relationship between 2 Categorical variables.relationship between 2 Categorical variables. Ho: There is no relationship between the variables.Ho: There is no relationship between the variables. Ha: There is a relationship between the variables. Ha: There is a relationship between the variables.
If two categorical variables are related, it means If two categorical variables are related, it means the chance that an individual falls into a the chance that an individual falls into a particular category for one variable depends upon particular category for one variable depends upon the particular category they fall into for the other the particular category they fall into for the other variable.variable.
Let’s say that we wanted to determine if there is Let’s say that we wanted to determine if there is a relationship between religion (Christian, Jew, a relationship between religion (Christian, Jew, Muslim, Other) and smoking. When we test if Muslim, Other) and smoking. When we test if there is a relationship between these two there is a relationship between these two variables, we are trying to determine if being part variables, we are trying to determine if being part of a particular religion makes an individual more of a particular religion makes an individual more likely to be a smoker. If that is the case, then we likely to be a smoker. If that is the case, then we can say that Religion and Smoking are can say that Religion and Smoking are relatedrelated or or associatedassociated..
Chi-Square test for 2-way Chi-Square test for 2-way tablestables
Suppose we are studying two categorical variables in a population, Suppose we are studying two categorical variables in a population, where the first variable has where the first variable has rr levels (i.e. possible outcomes) and the levels (i.e. possible outcomes) and the second one has second one has ss levels. levels.
We can summarize a sample from this population using a table with We can summarize a sample from this population using a table with rr rows and rows and cc columns. columns.
A two-way table, also called contingency table, displays the counts of A two-way table, also called contingency table, displays the counts of how many individuals fall into each possible combination of categories how many individuals fall into each possible combination of categories of two categorical variables. So, each cell of the table (total number of of two categorical variables. So, each cell of the table (total number of cells is cells is rr x xcc) represents a combination of categories of the two variables.) represents a combination of categories of the two variables.
The following table presents the data on race and smoking. The two The following table presents the data on race and smoking. The two variables of interest, race and smoking, have variables of interest, race and smoking, have r r = 4 and = 4 and c c = 2, resulting = 2, resulting in 4x2=8 combinations of categories.in 4x2=8 combinations of categories.
RaceRace NSmokeNSmoke SmokeSmoke
CaucasiaCaucasiann
620620 7575
BlackBlack 240240 4141
HispanicHispanic 130130 2929
OtherOther 190190 3838
Chi-Square test for 2-way Chi-Square test for 2-way tablestables
By considering the number if observation falling into each By considering the number if observation falling into each category, we will see how to test the hypotheses of the form: category, we will see how to test the hypotheses of the form:
HH00: The two variables are not associated. : The two variables are not associated.
HHaa: The two variables are associated.: The two variables are associated.
Two different experimental situations will lead to contingency Two different experimental situations will lead to contingency tables tables 1.1. If we have two populations under study, both of which If we have two populations under study, both of which
have a particular trait with respect to a categorical variable. have a particular trait with respect to a categorical variable. In this case the null hypothesis is a statement of In this case the null hypothesis is a statement of homogeneityhomogeneity among the two populations. among the two populations.
2.2. If we have one population under study, and we are If we have one population under study, and we are interested to check the relationship between two interested to check the relationship between two categorical variables. In this case the null hypothesis is a categorical variables. In this case the null hypothesis is a statement of statement of independenceindependence between the two between the two variables. variables.
For sufficiently large samples, the same test is appropriate for For sufficiently large samples, the same test is appropriate for both of these situations. This test is called chi-square test, and both of these situations. This test is called chi-square test, and in the following we will go over the steps in for testing the in the following we will go over the steps in for testing the relationship between two variables.relationship between two variables.
Some Notation!Some Notation!
For i taking values from 1 to For i taking values from 1 to rr (number of rows) and (number of rows) and j taking values from 1 to j taking values from 1 to cc (number of columns), (number of columns), denote: denote: RRii = total count of observations in the i = total count of observations in the i--th row. th row.
CCjj = total count of observations in the j = total count of observations in the j--th column.th column.
OOijij = = observed countobserved count for the cell in the i-th row and the j-th for the cell in the i-th row and the j-th column. column.
EEijij = = expected countexpected count for the cell in the i-th row and the j-th for the cell in the i-th row and the j-th column if the two variables were independent, i.e if Hcolumn if the two variables were independent, i.e if H00 was was true. These counts are calculated astrue. These counts are calculated as
n
CRE jiij
thus,
n size sample Total
alColumn tot totalRow Expected
ExampleExample
EE1111=(695x1180)/1363 =(695x1180)/1363 EE1212=(695x183)/1363=(695x183)/1363
EE2121=(281x1180)/1363 =(281x1180)/1363 EE2222=(281x183)/1363=(281x183)/1363
EE3131=(159x1180)/1363 =(159x1180)/1363 EE3232=(159x183)/1363=(159x183)/1363
EE4141=(228x1180)/1363 =(228x1180)/1363 EE4242=(228x183)/1363=(228x183)/1363
RaceRace NSmokeNSmoke SmokeSmoke TotalTotal
CaucasiaCaucasiann
OO11 11 = 620= 620 OO12 12 = 75= 75 RR1 1 = 695= 695
BlackBlack OO21 21 = 240= 240 OO22 22 = 41= 41 RR2 2 = 281= 281
HispanicHispanic OO31 31 = 130= 130 OO32 32 = 29= 29 RR3 3 = 159= 159
OtherOther OO41 41 = 190= 190 OO42 42 = 38= 38 RR4 4 = 228= 228
TotalTotal CC1 1 = 1180= 1180 CC2 2 = 183= 183 n=1363n=1363
Chi-Square Analysis DetailsChi-Square Analysis DetailsThe 5 Steps in a Chi-Square Test:The 5 Steps in a Chi-Square Test: Step 1: Write the null and alternative hypothesis.Step 1: Write the null and alternative hypothesis.
HH00: There is no relationship between the variables.: There is no relationship between the variables.
HHaa: There is a relationship between the variables. : There is a relationship between the variables.
Step 2: Check conditions.Step 2: Check conditions.A) All A) All expected countsexpected counts should be > 1. should be > 1.B) At least 80% of B) At least 80% of expected countsexpected counts should > 5. should > 5.
Step 3: Calculate Test Statistic and p-value.Step 3: Calculate Test Statistic and p-value.The test statistic measure the difference between the observed The test statistic measure the difference between the observed counts and the expected counts assuming independence.counts and the expected counts assuming independence.
This is called This is called chi-square statisticchi-square statistic because if the null hypothesis because if the null hypothesis is is
true, then it has a true, then it has a chi-square distributionchi-square distribution with (r-1)x(c-1) degrees with (r-1)x(c-1) degrees of of
freedom.freedom.
ji ij
ijij
E
EO
,
2
cells all
22 )(
Expected
Expected) - (Observed
Step 3 Cont. Find the p-value.Step 3 Cont. Find the p-value. If the If the χχ22- statistic is large, it implies that the observed counts are - statistic is large, it implies that the observed counts are
not close to the counts we would expect to see if the two not close to the counts we would expect to see if the two variables were independent. Thus, ''large'' variables were independent. Thus, ''large'' χχ2 2 gives evidence gives evidence against the null hypothesis, and supports the alternative.against the null hypothesis, and supports the alternative.
The p-value of the chi-square test is the probability that the The p-value of the chi-square test is the probability that the χχ22- - statistic, is as large or larger than the value we obtained if Hstatistic, is as large or larger than the value we obtained if H00 is is true. Also, if Htrue. Also, if H00 is true, the is true, the χχ22- statistic has chi-square distribution - statistic has chi-square distribution with (r-1)x(c-1) df.with (r-1)x(c-1) df.
Thus, the p-value for Chi-Square test is ALWAYS the area to the Thus, the p-value for Chi-Square test is ALWAYS the area to the
right of the test statistic under the curve, i.e. p-value = P(X> right of the test statistic under the curve, i.e. p-value = P(X> χχ22), ), where X has a chi-square distribution with (r-1)x(c-1) df curve.where X has a chi-square distribution with (r-1)x(c-1) df curve.
To get this probability we need to use a chi-square distribution To get this probability we need to use a chi-square distribution with (r-1)x(c-1) df (Table A.4). Using Minitab, or any other with (r-1)x(c-1) df (Table A.4). Using Minitab, or any other statistical software, you can obtain the p-value form the output. statistical software, you can obtain the p-value form the output. Otherwise, you can report a range for the p-value using Table 4 Otherwise, you can report a range for the p-value using Table 4 (since usually you will not be able to find the exact p-value on the (since usually you will not be able to find the exact p-value on the table.table.
Chi-Square Analysis DetailsChi-Square Analysis Details
Step 4: Decide whether or not the result is Step 4: Decide whether or not the result is statistically significant.statistically significant.
The results are statistically significant if the p-value is less The results are statistically significant if the p-value is less than than
alpha, where alpha is the significance level (usually alpha, where alpha is the significance level (usually αα = 0.05). = 0.05).
Step 5: Report the conclusion in the context of the Step 5: Report the conclusion in the context of the situation.situation.
TheThe p-valuep-value is ______ which isis ______ which is < a, this result < a, this result is is statistically significantstatistically significant. . Reject the H0Reject the H0 Conclude Conclude that (the two variables) are related.that (the two variables) are related.
TheThe p-valuep-value is ______ which isis ______ which is > a, this result > a, this result is NOT is NOT statistically significantstatistically significant. . We cannot reject the H0We cannot reject the H0 Cannot conclude that (the two variables) are related.Cannot conclude that (the two variables) are related.
Chi-Square Analysis DetailsChi-Square Analysis Details
Detailed ExampleDetailed Example Derek wants to know if the geographical area that a Derek wants to know if the geographical area that a
student grew up in is associated with whether or not student grew up in is associated with whether or not that the student drinks alcohol. Below are the results that the student drinks alcohol. Below are the results he obtained from a random sample of PSU students he obtained from a random sample of PSU students
NoNo YesYes TotalTotal
Big City Big City 2121 6565 8686
Rural Rural 1111 130130 141141
Small Town Small Town 1818 198198 216216
Suburban Suburban 3737 345345 382382
TotalTotal 8787 738738 825825
Detailed ExampleDetailed Example1. 1. HHoo: There is no relationship between the geographical area that a : There is no relationship between the geographical area that a
student grew up and whether or not that the student drinks student grew up and whether or not that the student drinks alcohol. alcohol.
HHaa: There is relationship between the geographical area that a : There is relationship between the geographical area that a student student
grew up and whether or not that the student drinks alcohol.grew up and whether or not that the student drinks alcohol.
2.2. To check the conditions we need to calculate the expected counts To check the conditions we need to calculate the expected counts for for
each cell. each cell.
EE11 11 = (R= (R11xCxC11)/n = (86x87)/825 = 9.07,)/n = (86x87)/825 = 9.07,
EE12 12 = (R= (R11xCxC22)/n = (86x738)/825 = 76.93, …)/n = (86x738)/825 = 76.93, …
EE32 32 = (R= (R33xCxC22)/n = ___________________, …)/n = ___________________, …
Detailed ExampleDetailed Example
Here is the Minitab output Here is the Minitab output
with the Observed and with the Observed and
Expected counts for each Expected counts for each
cell. We can see that the cell. We can see that the
conditions are satisfied!conditions are satisfied!
No Yes AllNo Yes All Big_City 21 65 86Big_City 21 65 86 9.07 76.93 9.07 76.93 86.0086.00 Rural 11 130 141Rural 11 130 141 14.87 126.13 14.87 126.13 141.00141.00 SmallTow 18 198 216SmallTow 18 198 216 22.78 193.22 22.78 193.22 216.00216.00 Suburban 37 345 382Suburban 37 345 382 40.28 341.72 40.28 341.72 382.00382.00 All 87 738 All 87 738 825825 87.00 738.00 87.00 738.00 825.00825.00
Detailed ExampleDetailed Example
3.3. Chi- Square statistic and P-value:Chi- Square statistic and P-value: χχ2 2 = sum {(Observed – Expected)= sum {(Observed – Expected)22/Expected}/Expected} = (21-9.07)= (21-9.07)22/9.07+ (65-76.93)/9.07+ (65-76.93)22/76.93/76.93 + (11-14.87)+ (11-14.87)22/14.87+ (130-126.13)/14.87+ (130-126.13)22/126.13/126.13 + (18-22.78)+ (18-22.78)22/22.78+ (198-193.22)/22.78+ (198-193.22)22/193.22/193.22 + (37-40.28)+ (37-40.28)22/40.28+ (345-341.72)/40.28+ (345-341.72)22/341.72/341.72 = 20.091= 20.091 df = (4-1)x(2-1) =3 df = (4-1)x(2-1) =3 p-value= P(X> 20.091) p-value= P(X> 20.091) << P(X> 16.17) = 0.001 (Table A.4) P(X> 16.17) = 0.001 (Table A.4)
4. 4. Since the p-value< 0.05, the test is significant, and we can Since the p-value< 0.05, the test is significant, and we can reject the null.reject the null.
5.5. We can conclude that there is a relationship between the We can conclude that there is a relationship between the geographical area that a student grew up and whether or geographical area that a student grew up and whether or not that the student drinks alcohol. not that the student drinks alcohol.
Special Case - Analyzing 2x2 Special Case - Analyzing 2x2 tablestables
In a lot of cases the categorical variables of interest have two In a lot of cases the categorical variables of interest have two levels each. In this case, we can summarize the data using a levels each. In this case, we can summarize the data using a contingency table having two rows and two columns (i.e. contingency table having two rows and two columns (i.e. r=c=2r=c=2). The general form of a 2x2 table is). The general form of a 2x2 table is
In this case, the chi-square statistic has the following In this case, the chi-square statistic has the following simplified form,simplified form,
Under the null hypothesis, Under the null hypothesis, χχ22-statistic has chi-square -statistic has chi-square distribution with (2-1)x(2-1)=1 degrees of freedom. distribution with (2-1)x(2-1)=1 degrees of freedom.
Column 1Column 1 Column 2Column 2 TotalTotal
Row 1Row 1 AA BB RR11
Row 2Row 2 CC DD RR22
TotalTotal CC11 CC22 nn
.)(
2121
22
CCRR
BCADn
Example for 2x2 table: Example for 2x2 table: Is there relationship between gender and
smoking habits?
GenderGender NSmokNSmokee
SmokSmokee
TotalTotal
MaleMale 540540 5252 592592
FemaleFemale 325325 3131 356356
TotalTotal 865865 8383 948948
Minitab OutputMinitab Output
C1 C2 Total
1 540 52 592 540.17 51.83
2 325 31 356 324.83 31.17
Total 865 83 948
Chi-Sq = 0.000 + 0.001 + 0.000 + 0.001 =
0.002DF = 1, P-Value = 0.968
Minitab uses the general formula
of the χ2 test statistic. points. decimal 3 to
up rounded Minitab because isoutput in the
statistic square-chi thefrom difference the
0.0016,83865356592
)3255231540(948 22
Relationship Between Chi-Square Relationship Between Chi-Square and 2 Proportions Testsand 2 Proportions Tests
When do we use Chi-Square and when do we use 2 When do we use Chi-Square and when do we use 2 proportions?proportions?
Situation 1Situation 1: Both categorical variables of interest have exactly 2 : Both categorical variables of interest have exactly 2 levels. Question - Is there a relationship between the variables, levels. Question - Is there a relationship between the variables, or is there a difference in the proportions? or is there a difference in the proportions?
Answer - Either Chi-Square or Two Sided Test of 2-proportions Answer - Either Chi-Square or Two Sided Test of 2-proportions will lead to the same conclusion!will lead to the same conclusion!
In this case, the In this case, the χχ2 2 –statistic = (z-statistic)–statistic = (z-statistic)22, and the p-, and the p-values of the two tests are equal, i.e.values of the two tests are equal, i.e.
P(XP(X(1df)(1df) > > χχ2 2 –stat) = 2 P (Z > |z-stat|). –stat) = 2 P (Z > |z-stat|).
Situation 2Situation 2: Both categorical variables of interest have exactly 2 : Both categorical variables of interest have exactly 2 levels. Question - Is one proportion greater/smaller than the levels. Question - Is one proportion greater/smaller than the other. other.
Answer - This is a one-sided test and you MUST use a test of 2 Answer - This is a one-sided test and you MUST use a test of 2 proportions.proportions.
Situation 3:Situation 3: At least one of the two categorical variables of At least one of the two categorical variables of interest has MORE than 2 levels.interest has MORE than 2 levels.
Question - Is there a relationship between the variables?Question - Is there a relationship between the variables? Answer - MUST use a Chi-Square Test.Answer - MUST use a Chi-Square Test.
Examples of Chi-Square and 2-Examples of Chi-Square and 2-ProportionsProportions
GenderGender NSmokeNSmoke SmokeSmoke
MaleMale 540540 5252
FemaleFemale 325325 3131
Q1: Is there a difference in the proportion of males and females that smoke?
Solution: Either a Chi-Square or Test of 2 proportions is fine.
2-proportions Chi-Square
H0: pm – pf = 0 H0: There is no relationship between Gender and Smoking.
Ha: pm – pf ≠ 0 Ha: There is a relationship between Gender and Smoking.
Q2: Is the proportion of males who smoke greater than the proportion of females who smoke?
Solution: Test of 2 proportions, because the alternative is one sided!
2-proportions H0: pm – pf = 0 vs Ha: pm – pf > 0
RaceRace NSmokeNSmoke SmokeSmoke
CaucasianCaucasian 620620 7575
BlackBlack 240240 4141
HispanicHispanic 130130 2929
OtherOther 190190 3838
Q: Is there a relationship between Race and Smoking? Is there a difference in the proportion smokers of the different races?
Solution: Chi-Square because Race has more than 2 levels!
Chi-Square Test
H0: There is no relationship between Race and Smoking.
Ha: There is a relationship between Race and Smoking.
Examples of Chi-Square and 2-Examples of Chi-Square and 2-ProportionsProportions